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Abstract 



We show a principled way of deriving online learning algorithms from a minimax analysis. Various 
upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield 
algorithms. This allows us to seamlessly recover known methods and to derive new ones. Our framework 
also captures such "unorthodox" methods as Follow the Perturbed Leader and the B? forecaster. We 
emphasize that understanding the inherent complexity of the learning problem leads to the development 
of algorithms. 

We define local sequential Rademacher complexities and associated algorithms that allow us to obtain 
faster rates in online learning, similarly to statistical learning theory. Based on these localized complex- 
ities we build a general adaptive method that can take advantage of the suboptimality of the observed 
sequence. 

We present a number of new algorithms, including a family of randomized methods that use the idea of 
a "random playout" . Several new versions of the Follow-the-Perturbed-Leader algorithms are presented, 
as well as methods based on the Littlestone's dimension, efhcient methods for matrix completion with 
trace norm, and algorithms for the problems of transductive learning and prediction with static experts. 



1 Introduction 

This paper studies the online learning framework, where the goal of the player is to incur small regret while 
observing a sequence of data on which we place no distributional assumptions. Within this framework, many 
algorithms have been developed over the past two decades, and we refer to the book of Cesa-Bianchi and 
Lugosi [7] for a comprehensive treatment of the subject. More recently, a non-algorithmic minimax approach 
has been developed to study the inherent complexities of sequential problems [2, 1, 14, 19]. In particular, 
it was shown that a theory in parallel to Statistical Learning can be developed, with random averages, 
combinatorial parameters, covering numbers, and other measures of complexity. Just as the classical learning 
theory is concerned with the study of the supremum of empirical or Rademacher process, online learning 
is concerned with the study of the supremum of a martingale or a certain dyadic process. Even though 
complexity tools introduced in [14, 16, 15] provide ways of studying the minimax value, no algorithms have 
been exhibited to achieve these non-constructive bounds in general. 

In this paper, we show that algorithms can, in fact, be extracted from the minimax analysis. This obser- 
vation leads to a unifying view of many of the methods known in the literature, and also gives a general 
recipe for developing new algorithms. We show that the potential method, which has been studied in various 
forms, naturally arises from the study of the minimax value as a certain relaxation. We further show that 
the sequential complexity tools introduced in [i-.] are, in fact, relaxations and can be used for constructing 
algorithms that enjoy the corresponding bounds. By choosing appropriate relaxations, we recover many 
known methods, improved variants of some known methods, and new algorithms. One can view our frame- 
work as one for converting a non-constructive proof of an upper bound on the value of the game into an 
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algorithm. Surprisingly, this allows us to also study such "unorthodox" methods as Follow the Perturbed 
Leader [10], and the recent method of [>] under the same umbrella with others. We show that the idea of a 
random playout has a solid theoretical basis, and that Follow the Perturbed Leader algorithm is an example 
of such a method. It turns out that whenever the sequential Rademacher complexity is of the same order as 
its i.i.d. cousin, there is a family of randomized methods that avoid certain computational hurdles. Based 
on these developments, we exhibit an efficient method for the trace norm matrix completion problem, novel 
Follow the Perturbed Leader algorithms, and efficient methods for the problems of transductive learning and 
prediction with static experts. 

The framework of this paper gives a recipe for developing algorithms. Throughout the paper, we stress 
that the notion of a relaxation, introduced below, is not appearing out of thin air but rather as an upper 
bound on the sequential Rademacher complexity. The understanding of inherent complexity thus leads to 
the development of algorithms. 

One unsatisfying aspect of the minimax developments so far has been the lack of a localized analysis. Local 
Rademacher averages have been shown to play a key role in Statistical Learning for obtaining fast rates. It 
is also well-known that fast rates are possible in online learning, on the case-by-case basis, such as for online 
optimization of strongly convex functions. We show that, in fact, a localized analysis can be performed at an 
abstract level, and it goes hand-in-hand with the idea of relaxations. Using such localized analysis, we arrive 
at local sequential Rademacher and other local complexities. These complexities upper-bound the value of 
the online learning game and can lead to fast rates. What is equally important, we provide an associated 
generic algorithm to achieve the localized bounds. We further develop the ideas of localization, presenting 
a general adaptive (data-dependent) procedure that takes advantage of the actual moves of the adversary 
that might have been suboptimal. We illustrate the procedure on a few examples. Our study of localized 
complexities and adaptive methods follows from a general agenda of developing universal methods that can 
adapt to the actual sequence of data played by Nature, thus automatically interpolating between benign and 
minimax optimal sequences. 

This paper is organized as follows. In Section 2 we formulate the value of the online learning problem and 
present the (possibly computationally inefficient) minimax algorithm. In Section 3 we develop the idea of 
relaxations and the meta algorithm based on relaxations, and present a few examples. Section 4 is devoted 
to a new formalism of localized complexities, and we present a basic localized meta algorithm. We show, 
in particular, that for strongly convex objectives, the regret is easily bounded through localization. Next, 
in Section 5, we present a fully adaptive method that constantly checks whether the sequence being played 
by the adversary is in fact minimax optimal. We show that, in particular, we recover some of the known 
adaptive results. We also demonstrate how local data-dependent norms arise as a natural adaptive method. 
The remaining sections present a number of new algorithms, often with superior computational properties 
and regret guarantees than what is known in the literature. 

Notation: A set {xi, . . . ,xt} is often denoted by xi-.t- A t-fold product of X is denoted by A"*. Expectation 
with respect to a random variable Z with distribution p is denoted by or Mz~p- The set {1, . . . ,r} is 
denoted by [T], and the set of all distributions on some set A by A(^). The inner product between two 
vectors is written as (a, b) or as a^b. The set of all functions from A" to 3^ is denoted by y'^ . Unless specified 
otherwise, e denotes a vector (ei,...,eT) of i.i.d. Rademacher random variables. An A'-valued tree x of 
depth d is defined as a sequence (xi, . . . ,Xd) of mappings Xt : {±1}*"^ i-^ X (see [14]). We often write Xt(e) 
instead of Xt(ei:t_i). 

2 Value and The Minimax Algorithm 

Let be the set of learner's moves and X the set of moves of Nature. The online protocol dictates that 
on every round t = 1, . . . ,T the learner and Nature simultaneously choose ft ^ J^, xt & X, and observe each 
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other's actions. The learner aims to minimize regret 



t=i 



where i : x X ^ M. is a known loss function. Our aim is to study this online learning problem at an 
abstract level without assuming convexity or other properties of the loss function and the sets and X. 
We do assume, however, that £, T , and X are such that the minimax theorem in the space of distributions 
over T and X holds. By studying the abstract setting, we are able to develop general algorithmic and 
non-algorithmic ideas that are common across various application areas. 



The starting point of our development is the minimax value of the associated online learning game: 



Vt(^) = inf sup E 

gieA(.F) xx<=X /i~9i 



inf sup E 



(1) 



where A(^) is the set of distributions on T . The minimax formulation immediately gives rise to the optimal 
algorithm that solves the minimax expression at every round t. That is, after witnessing xi, . . . ,Xt-i and 
/i, . . . , /t_i, the algorithm returns 



argmin \ sup E inf sup E . . . inf sup E 

geA(:F) [ Xt ft-qlt+i xt+i ft+i It xt It 



argmin I sup E 



gEA(Jc) 



!t~q 



i{ft,xt) + inf sup E 

1t+l xt+i ft+l 



-inij:£{f,x, 
.infsupE ^ - inf 

It XT It f^^i=l 



(2) 



Henceforth, if the quantification in inf and sup is omitted, it will be understood that xt, ft, Pt, Qt range over 
X, T , A(,%'), A(^), respectively. Moreover, Ea;^ is with respect to while E/^ is with respect to g*. The 
first sum in (2) starts <Ai = t since the partial loss Eili ^(/ij a^j) has been fixed. We now notice a recursive 
form for defining the value of the game. Define for any t 6 [T - 1] and any given prefix x\,. . . ,Xt & X the 
conditional value 



inf sup-^ E + VtC^I^i, 

q^AiJ^) If-q 



• • : Xt, X ^ 



where 



,XT) = -inf and Vt(^) = Vt(^|{}). 

f'^^t=i 



Vt(^|xi, 

The minimax optimal algorithm specifying the mixed strategy of the player can be written succinctly 

. ,xt_i,a;)} . (3) 



qt = argmin suplE/^, Wf^x)] + Vri^^lxi 

qeA{3^) xeX 



This recursive formulation has appeared in the literature, but now we have tools to study the conditional value 
of the game. We will show that various upper bounds on Vri^lxi, ■ ■ ■ , Xt-i, x) yield an array of algorithms, 
some with better computational properties than others. In this way, the non-constructive approach of 
[14, 1-5, 16] to upper bound the value of the game directly translates into algorithms. 

The minimax algorithm in (3) can be interpreted as choosing the best decision that takes into account the 
present loss and the worst-case future. We then realize that the conditional value of the game serves as a 
"regularizer" , and thus well-known online learning algorithms such as Exponential Weights, Mirror Descent 
and FoUow-the-Regularized-Leader arise as relaxations rather than a "method that just works". 

The first step is to appeal to the minimax theorem and perform the same manipulation as in [1, 14], but 
only on the value from t + 1 onwards: 



Xi, 



,xt) = sup E ... sup E 



£ inf E ^(/„x.)-inf 
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This expression is still unwieldy, and the idea is now to come up with more manageable, yet tight, upper 
bounds of the conditional value. 



3 Relaxations and the Basic Meta- Algorithm 

A relaxation Rel () is a sequence of functions Rel^ {J^\xi, . . . ,Xt) for each t e [T]. We shall use the notation 
Rely (J^) for Rel^ (.F|{}). A relaxation will be called admissible if for any xi, . . . ,xt ^ X , 

Ile\T{T\xi,...,Xt)> inf sup | E [^(/, a;)] + Rely . . . , x*, x)} (4) 

for alHe [T- 1], and 

T 

RelT i:F\xi, ...,xt)>- inf ^ ^(/> ^t)- 

t=i 

A strategy q that minimizes the expression in (4) defines an optimal algorithm for the relaxation Rel(). 
This algorithm is given below under the name "Meta- Algorithm" . However, minimization need not be exact: 
any q that satisfies the admissibility condition (4) is a valid method, and we will say that such an algorithm 
is admissible with respect to the relaxation Rel (). 



Algorithm 1 Meta- Algorithm MetAlgo 


Parameters: Admissible relaxation Rel 




for t = 1 to T do 




qt = argminggA(^) sup^,;t. [£(/, x)] + Relr {T\xi, . . 


.,Xt-l,x)} 


Play ft ~ qt and receive Xt from adversary 




end for 





Proposition 1. Let Rel () be an admissible relaxation. For any admissible algorithm with respect to Rel (), 
including the Meta- Algorithm, irrespective of the strategy of the adversary, 

Y,Ef,.M,xt)-Mj:e{f,xt)<RelT(T) , (5) 
t=i J^-^ t=i 

and therefore, 

E[Regy] < RelT (J") ■ 

We also have that 

Vt{T) < RelriT) . 

If a < £(f,x) < b for all f e J^,x e X , the Hoeffding-Azuma inequality yields, with probability at least 1-5, 

RegT = f e{ft,xt) - inf ^ £{f, xt) < Relr (T) + {b- a) VT/2 ■ log(2/<5) . 

t=i f'=^ t=i 

Further, if for all t e [T] , the admissible strategies qt are deterministic, 

RegT < RelT (:^) ■ 
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The reader might recognize Rel as a potential function. It is known that one can derive regret bounds 
by coming up with a potential such that the current loss of the player is related to the difference in the 
potentials at successive steps, and that the loss of the best decision in hindsight can be extracted from the 
final potential. The origin of "good" potential functions has always been a mystery (at least to the authors). 
One of the conceptual contributions of this paper is to show that they naturally arise as relaxations on the 
conditional value. The conditional value itself can be characterized as the tightest possible relaxation. 

In particular, for many problems a tight relaxation (sometimes within a factor of 2) is achieved through 
symmetrization. Define the conditional Sequential Rademacher complexity 



DIt{^\xi, ...,xt) = supEe^^j.^ sup 



2 eJ{f,^s-t{eM:s-i))-T.^(f'^s) 



t 

s=l 



(6) 



Here the supremum is over all A'-valued binary trees of depth T - t. One may view this complexity as a 
partially symmetrized version of the sequential Rademacher complexity 



9iT(^) = ^KtC^ I {}) = supE,,.^ sup 



2|^e3^(/,x,(ei.,_i)) 



(7) 



defined in [14]. We shall refer to the term involving the tree x as the "future" and the term being subtracted 
off - as the "past" . This indeed corresponds to the fact that the quantity is conditioned on the already 
observed xi, . . . , xt, while for the future we have the worst possible binary tree.^ 

Proposition 2. The conditional Sequential Rademacher complexity is admissible. 



The proof of this proposition is given in the Appendix and it corresponds to one step of the sequential 
symmetrization proof in [14]. We note that the factor 2 appearing in (6) is not necessary in certain cases 
(e.g. binary prediction with absolute loss). 

We now show that several well-known methods arise as further relaxations on the conditional sequential 
Rademacher complexity 9^t- 



Exponential Weights Suppose is a finite class and \i{f,x) \ < 1. In this case, a (tight) upper bound on 
sequential Rademacher complexity leads to the following relaxation: 



RelT . . . , = mf I ^ log ^ exp | -A ^ £{f, x,)\\ + 2A(T - t) \ (8) 

Proposition 3. The relaxation (8) is admissible and 

D\t{^\xi,- ■■,xt)< Rely (^|a;i, ■■■,Xt). 

Furthermore, it leads to a parameter-free version of the Exponential Weights algorithm, defined on round 
t + 1 by the mixed strategy 

qi,i(/)ocexp|-A,*^^£(/,x,)j 
with A( the optimal value in (8). The algorithm's regret is bounded by 

RelT (^) < 2\/2Tlog|j'^| . 



^It is somewhat cumbersome to write out the indices on 'x.s-t(^t+i:s-i) in (6), so we will instead use Xs(€) for s = 1, . . . ,T-t, 
whenever this does not cause confusion. 
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The Chernoff-Cramer inequality tells us that (8) is the tightest possible relaxation. The proof of Proposition 3 
reveals that the only inequality is the softmax which is also present in the proof of the maximal inequality 
for a finite collection of random variables. In this way, exponential weights is an algorithmic realization of 
a maximal inequality for a finite collection of random variables. The connection between probabilistic (or 
concentration) inequalities and algorithms runs much deeper. 

We point out that the exponential-weights algorithm arising from the relaxation (8) is a parameter-free 
algorithm. The learning rate A* can be optimized (via one-dimensional line search) at each iteration with 
almost no cost. This can lead to improved performance as compared to the classical methods that set a 
particular schedule for the learning rate. 

Mirror Descent In the setting of online linear optimization, the loss is i{f,x) = {f,x). Suppose J- is 
a unit ball in some Banach space and X is the dual. Let || • || be some (2, C)-smooth norm on X (in the 
Euclidean case, C = 2). Using the notation xt-i = Yfs=i^si a straightforward upper bound on sequential 
Rademacher complexity is the following relaxation: 

RelT . . . , xt) = ^ \\xt-if + (v \\it-if , xt) + C{T-t+l) (9) 

Proposition 4. The relaxation (9) is admissible and 

d\T{^\xi, .■.,xt) < RgIt . ..,xt) . 

Furthermore, it leads to the Mirror Descent algorithm with regret at most Rely (^) < \JlCT . 

An important feature of the algorithms we just proposed is the absence of any parameters, as the step size is 
tuned automatically. We had chosen Exponential Weights and Mirror Descent for illustration because these 
methods are well-known. Our aim at this point was to show that the associated relaxations arise naturally 
(typically with a few steps of algebra) from the sequential Rademacher complexity. More examples are 
included later in the paper. It should now be clear that upper bounds, such as the Dudley Entropy integral, 
can be turned into a relaxation, provided that admissibility is proved. Our ideas have semblance of those in 
Statistics, where an information-theoretic complexity can be used for defining penalization methods. 

4 Localized Complexities and the Localized-Meta Algorithm 

The localized analysis plays an important role in Statistical Learning Theory. The basic idea is that better 
rates can be proved for empirical risk minimization when one considers the empirical process in the vicinity 
of the target hypothesis [11, 4]. Through this, localization gives extra information by shrinking the size of the 
set which needs to be analyzed. What does it mean to localize in online learning? As we obtain more data, 
we can rule out parts of T as those that are unlikely to become the leaders. This observation indeed gives 
rise to faster rates. Let us develop a general framework of localization and then illustrate it on examples. 
We emphasize that the localization ideas will be developed at an abstract level where no assumptions are 
placed on the loss function or the sets T and X. 

Given any xi, . . . , xt e A", for any k>\ define 

{t+fc t+k \ 

feT:3xt+i,...,Xt+ksX s.t. ^ ^(/, x,) = inf ^ a;») [ • 
1=1 f'^^ i=l J 

That is, given the instances Xi,...,xt, the set J^'^{xi, . . . ,Xt) is the set of elements that could be the 
minimizers of cumulative loss on t + fc instances, the first t of which are xi,...,Xt and the remaining k 
arbitrary. We shall refer to minimizers of cumulative loss as empirical risk minimizers (or, ERM). 
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Importantly, 

VT{J^\xi,...,Xt) =VT{^^'*{xi,...,Xt)\xi,...,Xt) ■ 

Henceforth, we shall use the notation kj = We now consider subdividing T into blocks of time 

ki, . . . ,km £ [T] such that km = T. With this notation, ki is the last time in the ith block. We then have 
regret upper bounded as 

T T T m ki 

ZKfuXt)-iniyKf,^t)<Z^(ft^^t)-Z inf ^ £{f,xt). (10) 

i=i /^-^t=i t=i ^=i^^'=H^i.-.^i,,_Jt=fc,_i+i 

The short inductive proof is given in Appendix, Lemma 26. We can now bound (10) by 

Tn I ki ki ^ 

E E ^U^^t)- inf E ^U.^t) 

^=1 \t=fci_l+l f^^''^[^^^-^'-^k^_l) t=ki-l+l ) 

^ E ^^Sk, i^k-i ' • ■ ■ ' ' h.-i ,---Jk,,^'''{xi,---, xi^^_^ ) ) 
1=1 

Hence, one can decompose the online learning game into blocks of m successive games. The crucial point 
to notice is that at the z*'* block, we do not compete with the best hypothesis in all of but rather only 
!F'''{xi, . . . ^). It is this localization based on history that could lead to possibly faster rates. While the 
"blocking" idea often appears in the literature (for instance, in the form of a doubling trick, as described 
below), the process is usually "restarted" from scratch by considering all of JF. Notice further that one need 
not choose all fci, . . . , km in advance. The player can choose ki based on history xi, . . . , x^. ^ and then use, 
for instance, the Meta-Algorithm introduced in previous section to play the game within the block ki using 
the localized class JF*^' (xi, . . . , Xj^,__^). Such adaptive procedures will be considered in Section 5, but presently 
we assume that the block sizes fci , . . . , /c,„ are fixed. 

While the successive localizations using subsets J^'''{xi, . . . ,x^_ ^) can provide an algorithm with possibly 
better performance, specifying and analyzing the localized subset (xi, . . . , x^. ^) exactly might not be 
possible. In such a case, one can instead use 

Tr{xi,...,Xj;.^ = {f eT:P(f I xi,... J < r} 

where P is some "property" of / given data. This definition echoes the definition of the set of r-minimizers 
of empirical or expected risk in Statistical Learning. Further, for a given k define 

r{k;xi, ...,xt)= inf{r : T''{xi,. ..,Xt)c Tr{xi, . . .,Xt)} 

the smallest "radius" such that J^r includes the set of potential minimizers over the next k time steps. Of 
course, if the property P does not enforce localization, the bounds are not going to exhibit any improvement, 
so P needs to be chosen carefully for a particular problem of interest. 

We have the following algorithm: 



Algorithm 2 Localized Meta-Algorithm 
Parameters : Relaxation Rel 

Initialize t = and blocks fci, . . . , fc^ s.t. X"=i ki = T 
for i = 1 to TO do 

Play ki rounds using MetAlgo {^r(ki;xi,...,xt)) ^^i^ set t = t + fci 
end for 
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Lemma 5. The regret of the Localized Meta-Algorithm is bounded as 

Reg^(a:i,...,XT)<gRelfc, (-^.(fc,.,^,...,,^ J ) 

Note that the above lemma points to local sequential complexities for online learning problems that can lead 
to possibly fast rates. In particular, if sequential Rademacher complexity is used as the relaxation in the 
Localized Meta-Algorithm, we get a bound in terms of local sequential Rademacher complexities. 

4.1 Local Sequential Complexities 

The following corollary is a direct consequence of Lemma 5. 

Corollary 6 (Local Sequential Rademacher Complexity). For any property P and any fci, . . . , km e N such 
that ki = T , we have that : 



Vt{T)< sup t^kjT. )) 



Clearly, the sequential Rademacher complexities in the above bound can be replaced with other sequential 
complexity measures of the localized classes that are upper bounds on the sequential Rademacher complex- 
ities. For instance, one can replace each Rademacher complexity 9\ki by covering number based bounds of 
the local classes, such as the analogues of the Dudley Entropy Integral bounds developed in the sequential 
setting in [ ]. Once can also use, for instance, fat-shattering dimension based complexity measures for these 
local classes. 

4.2 Examples 

4.2.1 Example : Doubling trick 

The doubling trick can be seen as a particular blocking strategy with ki = 2*"^ so that 

pogaTl + l , > flog2Tl + l 

Reg-r(a;i, . . . ,a;T) < Rel2.-i I ^"^(2 < ^ Rel2.-i(J-) 

for J^r defined with respect to some property P. The latter inequality is potentially loose, as the algorithm 
is "restarted" after the previous block is completed. Now if Rel is such that for any t, Relj (T) < t^ for 
some p then the regret is upper bounded by '^■^J^-p ■ main advantage of the doubling trick is of course 
that we do not need to know T in advance. 



4.2.2 Example : Strongly Convex Loss 

To illustrate the idea of localization, consider online convex optimization with A-strongly convex functions 
Xt-T^R (that is, i{f,x) = x{f)). Define 

RelT {J'lxi, ...,xt) = - inf ^ x^f) + {T-t) inf sup ||/ - /'|| 

An easy Lemma 27 in the Appendix shows that this relaxation is admissible. Notice that this relaxation grows 
linearly with block size and is by itself quite bad. However, with blocking and localization, the relaxation 
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gives an optimal bound for strongly convex objectives. To see this note that for = 1, any minimizer of 
HltiXi{f) has to be close to the minimizer ft of ZLi^iC/): due to strong convexity of the functions. In 
other words, the property 




T\xi, ...,Xt)£{feT:\\f-ft\\< l/{Xt)} = Tri^i, . . . 



The relaxation for the block of size fc = 1 is 



Reli {J^r{xi, ■ ■ ■,xt)) < 



inf 



sup 



11/ 



/'II, 



the radius of the smallest ball containing the localized set J>(a;i, . . . ,Xt), and 



we immediately get 



T 



neg^xi, ■■■,xt)<T, l/(Ai) < (1 + log(T))/A . 



t=i 



We remark that this proof is different in spirit from the usual proofs of fast rates for strongly convex functions, 
and it demonstrates the power of localization. 

5 Adaptive Procedures 

There is a strong interest in developing methods that enjoy worst-case regret guarantees but also take 
advantage of the suboptimality of the sequence being played by Nature. An algorithm that is able to do so 
without knowing in advance that the sequence will have a certain property will be called adaptive. Imagine, 
for instance, running an experts algorithm, and one of the experts has gained such a lead that she is clearly 
the winner (that is, the empirical risk minimizer) at the end of the game. In this case, since we are to be 
compared with the leader at the end, we need not focus on anyone else, and regret for the remainder of the 
game is zero. 

There has been previous work on exploiting particular ways in which sequences can be suboptimal. Examples 
include the Adaptive Gradient Descent of [•")] and Adaptive Hedge of ["_'()]. We now give a generic method 
which incorporates the idea of localization in order to adaptively (and constantly) check whether the sequence 
being played is of optimal or suboptimal nature. Notice that, as before, we present the algorithm at the 
abstract level of the online game with some decision sets X, and some loss function £. 

The adaptive procedure below uses a subroutine Block({xi, . . . , a;*}, r) which, given the history {xi, . . . ,Xt}, 
returns a subdivision of the next r rounds into sub-blocks. The choice of the blocking strategy has to be made 
for the particular problem at hand, but, as we show in examples, one can often use very simple strategies. 

Let us describe the adaptive procedure. First, for simplicity of exposition, we start with the doubling-size 
blocks. Here is what happens within each of these blocks. During each round the learner decides whether to 
stay in the same sub-block or to start a new one, as given by the blocking procedure Block. If started, the 
new sub-block uses the localized subset given history of adversary's moves up until last round. Choosing to 
start a new sub-block corresponds to the realization of the learner that the sequence being presented so far 
is in fact suboptimal. The learner then incorporates this suboptimality into the localized procedure. 

Lemma 7. Given some admissible relaxation Rel, the regret of the adaptive localized meta- algorithm (Al- 
gorithm 3) is bounded as 
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Algorithm 3 Adaptive Localized Met a- Algorithm 



Parameters : Relaxation Rel and block size calculator Block. 
Initialize t = 1 and nbl = 1, and suppose T = 2^^ - 1 for some c > 2. 
for i = 1 to c do 

G = R,el2^ ^^7- (2^ 5 Xx, • ■ ■ , ^t-l)) guaranteed value of relaxation 

TO = 1, curr = 1 and Ki = 2' 
while curr < 2* and i < T do 

. . . , Km') = Block ({a;i, . . . , Xt}, 2* - curr) •/. blocking for remainder of 2' 

if G > sup^^^^^...^^^^^^^ E™i Re^. then 

~ ^1' ^ ~ (^2; ■ ■ ■ 1 '^m')' ^ ~ ~ ^■^ better value, accept new blocking 

else 

k*^-^ = Klj A = (A2, ■ . . , A^^), TO = TO — 1 % else continue with current blocking 

end if 

Play fcnbi rounds using MetAlgo( ^",.(^.^^.2.^^...^^^)) 

nbl = nbl + 1, t = t + k*^^-^, curr = curr + k*^-^ 
Let 

G= sup T,RelKATr(K,:,,_,,^ )) 

end while 
end for 



where nbl is the number of blocks actually played and k* 's are adaptive block lengths defined within the 
algorithm. Further, irrespective of the blocking strategy Block used, if the relaxation Rel is such that for 
any t, Rel( (J-) < t^ for some p € (0, 1], then the worst case regret is always bounded as 

Reg^ < {TP - 2-P)/{l - . 

We now demonstrate that the adaptive algorithm in fact takes advantage of sub-optimality in several sit- 
uations that have been previously studied in the literature. On the conceptual level, adaptive localization 
allows us to view several fast rate results under the same umbrella. 



Example: Adaptive Gradient Descent Consider the online convex optimization scenario. Following 
the setup of [ ] , suppose the learner encounters a sequence of convex functions Xt with the strong convexity 
parameter at, potentially zero, with respect to a (2, C)-smooth norm || • ||. The goal is to adapt to the actual 
sequence of functions presented by the adversary. Let us invoke the Adaptive Localized Meta- Algorithm 
with a rather simple blocking strategy 

Block(K,...,.,},fc) = { ,^ ^, iff 

[ (1,1,...,1) otherwise 

This blocking strategy either says "use all of the next k rounds as one block" , or "make each of the next k 
time step into separate blocks" . Let ft be the empirical minimizer at the start of the block (that is after t 
rounds), and let yt = Vxt{ft)- Then we can use the localization 

^r(fe;xi,...,x,) = {feT:\\f-ft\\< 2min {1, k/ai-.t}} 

and relaxation 

1 /2 

Relfc {Tr{k;xi,...,xt)\yii---,yi) = - (/t,y<) + 2min{l,fc/cri:j(||?/j_if + |v||yi-if ,2/j) + C(A:-i + 1)) 
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where y^-i = Zj=i Vj- For the above relaxation we can show that the corresponding update at round t + i is 
given by 

-V||^»-i||' 
v/||i,_i||%C(fc-* + l) 

where k is the length of the current block. The next lemma shows that the proposed adaptive gradient 
descent recovers the results of [ ]. The method is a mixture of Follow the Leader -style algorithm and a 
Gradient Descent -style algorithm. 

Lemma 8. The relaxation specified above is admissible. Suppose the adversary plays 1-Lipchitz convex 
functions xi, . . . ,xt such that for any t e [T], Y,i=i Xi is a i.t- strongly convex, and further suppose that for 
some B < 1, we have that ai-.t = Bf^ . Then, for the blocking strategy specified above, 

1. Ifa< 1/2 then Regj, < O (Vt) 

2. // 1 > a > 1/2 then Regj. < O(^) 

3. Ifa=l then Regj, < O (^) 



ft+i = ft -max-j 1, 



Example: Adaptive Experts We now turn to the setting of Adaptive Hedge or Exponential Weights 
algorithm similar to the one studied in [20]. Consider the following situation: for all time steps after some r, 
there is an element (or, expert) / that is the best by a margin k over the next-best choice in J-^ in terms of the 
(unnormalized) cumulative loss, and it remains to be the winner until the end. Let us use the localization 

C t t 

I 1=1 J^-^ i=i 

the set of functions closer than the margin to the ERM. Let 

:Ft = \feT: ^^(/,xO=min^€(/,a;0 

I i=l J^-^ 1=1 

be the set of empirical minimizers at time t. We use the blocking strategy 

t t 



Block({a;i,...,xt},fc) = (j, fc- j) where j 



min ^(/, 2^0 - mill ^ i{f, Xi) 

ft^Ft i=i f^y^t i=i 



(11) 



which says that the size of the next block is given by the gap between empirical minimizer(s) and non- 
minimizers. The idea behind the proof and the blocking strategy is simple. If it happens at the start a new 
block that there is a large gap between the current leader and the next expert, then for the number of rounds 
approximately equal to this gap we can play a new block and not suffer any extra regret. 

Consider the relaxation (8) used for the Exponential Weights algorithm. 
Lemma 9. Suppose that there exists a single best expert 

T 

fr = argmin^£(/,a;t), 
J^-^ t=i 

and that for some k > 1 there exists t e [T] such that for all t > t and all f + fr the partial cumulative loss 

j^t{f,x,)-j^£{fT.x,)>k . 
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Then the regret of Algorithm 3 with the Exponential Weights relaxation, the blocking strategy (11) and the 
localization mentioned above is bounded as 



Reg J, < 4 min { T, T\og{\T\)] 



While we demonstrated a very simple example, the algorithm is adaptive more generally. Lemma 9 considers 
the assumption that a single expert becomes a clear winner after r rounds, with margin of k. Even when 
there is no clear winner throughout the game, we can still achieve low regret. For instance, this happens 
if only a few elements of T have low cumulative loss throughout the game and the rest of T suffers heavy 
loss. Then the algorithm adapts to the suboptimality and gives regret bound with the dominating term 
depending logarithmically only on the cardinality of the "good" choices in the set J- . Similar ideas appear 
in , and will be investigated in more generality in the full version of the paper. 

Example: Adapting to the Data Norm Recall that the set !F^{xi, . . . ,Xt) is the subset of functions 
in that are possible empirical risk minimizers when we consider xi, . . . ,Xt+k for some Xt+i, ■ ■ ■ ,Xt+k that 
can occur in the future. Now, given history xi,. . . ,Xt and a possible future sequence Xt+i, . . . , Xt+k, if ft+k is 
an ERM for xi , . . . , Xt+k and ft is an ERM for zi , . . . , xt then 

Y^{kk,x,)-Yi{fuX,) = Yl{h+k.x,)-Yt{ft.x,)+ Y ^Uux,)- Y e{ft+k,x,) 

i=l i=l 1=1 1=1 i=t+l i=t+l 

{t+k t+k 
E ^UuX^)- Y Kft+k,X,)\ 
i=t+l i=t+l 

Hence, we see that it suffices to consider localizations 

{t t ^ ( t+k _ t+k 

fiT: - E^(/*'^0 ^ sup ] E ^Ut.x^)- 

i=l i=l 2;t+l.---,a:t+fc U=t+1 i=t+\ 

If we consider online convex Lipschitz learning problems where T = {f ■ \\f\\ < 1} and loss is convex in / and 
is such that || V^(/, x) || ^ < 1 in the dual norm || • ||*, using the above argument we can use localization 

^r(k:.^,....x,) = \f^^-- E^(/'2^»)-E^(/*'^0<^i/-/ti| • (12) 

I i=l i=l J 

Further, using Taylor approximation we can pass to the localization 

^rik,.,....,..) = {f e :F : l\\f-ftC_,^<k\\f-ft\\] (13) 

where \\f\\l._^ = f^Htf, and Hf is the Hessian of the function g{f) = Ei=i^(/i^j)- Notice that the earlier 
example where we adapt to strong convexity of the loss is a special case of the above localization where we 
lower bound the data-dependent norm (Hessian-based norm) by the £2 norm times the smallest eigenvalue. 
If for instance we are faced with ?7-exp-concave losses, such as the squared loss, the data-dependent norm 
can be again lower bounded by 

11/11'........ ^'/r(Ev.)(Ev.) / 

and so we can use localization based on outer products of sum of gradients. We then do not "pay" for those 
directions in which the adversary has not played, thus adapting to the effective dimension of the sequence 
of plays. 

In general, for online convex optimization problems one can use localizations given in Equations (12) or 
(13). The localization in Equation (12) is applicable even in the linear setting, and if it so happens that 
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the adversary mainly plays in a one dimensional sub-space, then the algorithm automatically adapts to the 
adversary and yields faster rates for regret. As already mentioned, the example of adaptive gradient descent 
is a special case of localization in Equation (13). Of course, one needs to provide also an appropriate blocking 
strategy. A possible general blocking strategy could be : 

Block{{xi,..., xt}, k) = {j,k-j), where j = argmin lUelj {j^r(xi,...,xt)) + sup Reh-j {J^r(xu-,xt+k)) 

je{0,...,k} { xt+i,...,xt+j 



In the remainder of the paper, we develop new algorithms to show the versatility of our approach. One could 
try to argue that the introduction of the notion of a relaxation has not alleviated the burden of algorithm 
development, as we simply pushed the work into magically coming up with a relaxation. We would like 
to stress that this is not so. A key observation is that a relaxation does not appear out of thin air, but 
rather as an upper bound on the sequential Rademacher complexity. Thus, a general recipe is to start with 
a problem at hand and develop a sequence of upper bounds until one obtains a computationally feasible one, 
or until other desired properties are satisfied. Exactly for this purpose, the proofs in the appendix derive 
the relaxations rather than just present them as something given. Since one would follow the same upper 
bounding steps to prove an upper bound on the value of the game, the derivation of the relaxation and the 
proof of the regret bound go hand-in-hand. For this reason, we sometimes omit the explicit mention of a 
regret bound for the sake of conciseness: the algorithms enjoy the same regret bound as that obtained by 
the corresponding non-constructive proof of the upper bound. 



6 Classification 



We start by considering the problem of supervised learning, where X is the space of instances and 3^ the 
space of responses (labels). There are two closely related protocols for the online interaction between the 
learner and Nature, so let us outline them. The "proper" version of supervised learning follows the protocol 
presented in Section 2: at time t, the learner selects ft e Nature simultaneously selects {xt,yt) ^ Xxy, and 
the learner suffers the loss £{f{xt),yt)- The "improper" version is as follows: at time t, Nature chooses Xt e X 
and presents it to the learner as "side information" , the learner then picks ijt^y and Nature simultaneously 
chooses yt 6 y. In the improper version, the loss of the learner is £{yt,yt), and it is easy to see that we 
may equivalently state this protocol as the learner choosing any function ft € (not necessarily in J^), and 
Nature simultaneously choosing {xt,yt)- We mostly focus on the "improper" version of supervised learning, 
as the distinction does not make any difference in any of the bounds. 

For the improper version of supervised learning, we may write the value in (1) as 



Vt{^) = sup inf sup E ... sup inf sup E 

xieX qi^A{y) y^eX Vi-li xt^X qT<^My) yr^X Vt-Qt 



f^i{yt,yt)-Mj:e{f{xt),yt) 

■f^-^ i=l 



t=l 



and a relaxation Rel () is admissible if for any {xi,yi) . . . , {xt, yr) ^ X x y, 



sup inf sup-^ E £{y,y) +'RelT 
xex q^My) yey {v~q 



y^)}ll, {x, y)) < RelT (^|{(a;„2/.)}li) 



(14) 
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and 

RelT(^|{(x„2/,;)}f=i) > - \ni ye{f{xt),yt). 

J^-^ t=i 

Let us now focus on binary label prediction, that is 3^ = {±1}- In this case, the supremum over y in (14) 
becomes a maximum over two values. Let us now take the absolute loss y) = \y - y\ = ^- yy- We can see 
that the optimal randomized strategy, given the side information x, is given by (14) as 

argmin max{l -q + RgIt {t\{{x^, yj)}-=i, (x, 1)) , 1 + g + RgIt {T\{{x^,y{)]\^^,{x, -1))} 
which is achieved by setting the two expressions equal to each other: 

q = ^ {Relr (^|{(x„ y,)}li, i^. 1)) " Rely {T\{{x,,yi)]U,{x, -1))} (15) 

This result will be specialized in the latter sections for particular relaxations Rel ( ) and extended beyond ab- 
solute loss. We remark that the extension to fc-class prediction is immediate and involves taking a maximum 
over k terms in (14). 

6.1 Algorithms Based on the Littlestone's Dimension 

Consider the problem of binary prediction, as described above. Further, assume that T has a finite Little- 
stone's dimension Ldim(^) [12, 6]. Suppose the loss function is i{y,y) = \y - y\, and consider the "mixed" 
conditional Rademacher complexity 

supE,sup|25e./(x.(e))-El/(2:,)-y4 (16) 

as a possible relaxation. Observe that the above complexity is defined with the loss function removed (in a 
contraction-style argument [14]) in the terms involving the "future", in contrast with the definition (6). The 
latter is defined with loss functions on both the "future" and the "past" terms. In general, if we can pass from 
the sequential Rademacher complexity over the loss class to the sequential Rademacher complexity of 
the base class we may attempt to do so step-by-step by using the "mixed" type of sequential Rademacher 
complexity as in (16). This idea shall be used several times later in this paper. 

The admissibility condition (14) with the conditional sequential Rademacher (16) as a relaxation would 
require us to upper bound 

sup inf max I E jyt - ?/t| + supE^ sup (2 ^ ei/(xj(e)) - ^ [/(a;^) - ?/i|U (17) 

xt 9t6[-l,l] yt6{±l} (at-?* X fej^ [ i=l i=l J J 

We observe that the supremum over x is preventing us from obtaining a concise algorithm. We need to 
further "relax" this supremum, and the idea is to pass to a finite cover of J- on the given tree x and then 
proceed as in the Exponential Weights example for a finite collection of experts. This leads to an upper 
bound on (16) and gives rise to algorithms similar in spirit to those developed in [ ], but with more attractive 
computational properties and defined more concisely. 

Define the function g{d,t) = Sf=o(*)' '^hi^h is shown in [ ] to be the maximum size of an exact (zero) 
cover for a function class with the Littlestone's dimension Ldim = d. Given {{xi,yt),. ■ ■ ,{xt,yt)} and 
cr= {ai,...,at) 6{±1}*, let 

^t(a) = {/6 J-:/(x,) = a, Vi<0, 
the subset of functions that agree with the signs given by cr on the "past" data and let 

^U,...,., = = {{f{xi), . . . , f{xt)) ■■JeT} 
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be the projection of T onto xi,...,xt. Denote Lt{f) = T,l=i\fixi) - yi\ and Lt{a) = Y,i=iWi ~ Hi] for 
a 6 {±1}*. The foUowmg proposition gives a relaxation and two algorithms, both of which achieve the 
0(^Ldini(^)TlogT) regret bound proved in [g], yet both different from the algorithm in that paper. 

Proposition 10. The relaxation 



RelT 2/*)) = J log I ^ g{Ldim{Tt{cT)),T- t) exp {-AL^^^)} j + 2A(r - t) 



is admissible and leads to an admissible algorithm 

^ ^^^^ _ g(Ldim(j^f (cr, +l)),T-t) exp {-ALt-i(cr)} ^^^^ 

* T,{a,at)ej^\^t 5(Ldim( J"* (cr, at)),T-t) exp {-XLt-i{a)} ' 

with qt{-l) = 1 - qt{ + l). An alternative method for the same relaxation and the same regret guarantee is to 
predict the label yt according to a distribution with mean 

1 Y,(a,at)ej^\^t9{^d^M^t{(J,cyt)),T-t)exp{-XLt^i{a)}exp{-X{l-at)} 
" 2A <7(Ldim(^t((7, at)), T - i) exp {-ALt_i(a)} exp {-A(l + at)} ' 

There is a very close correspondence between the proof of Proposition 10 and the proof of the combinatorial 
lemma of [ ], the analogue of the Vapnik-Chervonenkis-Sauer-Shelah result. 

The two algorithms presented above show two alternatives: one through employing the properties of expo- 
nential weights, and the other is through the solution in (15). The merits of the two approaches remain 
to be explored. In particular, it appears that the method based on (15) can lead to some non-trivial new 
algorithms, distinct from the more common exponential weighting technique. 



7 Randomized Algorithms and Follow the Perturbed Leader 

We now develop a class of admissible randomized methods that arise through sampling. Consider the 
objective 

inf sup{E/^, [e{f,x)] + RelT {^\xi, ■ ■ .,Xt-i,x)} 

given by a relaxation Rel (). If Rel () is the sequential (or classical) Radcmacher complexity, it involves an 
expectation over sequences of coin flips, and this computation (coupled with optimization for each sequence) 
can be prohibitively expensive. More generally, Rel () might involve an expectation over possible ways 
in which the future might be realized. In such cases, we may consider a rather simple "random playout" 
strategy: draw the random sequence and solve only one optimization problem for that random sequence. 
The ideas of random playout have been discussed previously in the literature for estimating the utility of 
a move in a game (see also [-i]). In this section we show that, in fact, the random playout strategy has a 
solid basis: for the examples we consider, it satisfies admissibility. Furthermore, we show that Follow the 
Perturbed Leader is an example of such a randomized strategy. 

Let us informally describe the general idea, as the key steps might be hard to trace in the proofs. Suppose 
our objective is of the form 

S'(g) = sup a;) + E„.p$(w, x) ) 

X 

for some functions ^' and $, and q a mixed strategy. We have in mind the situation where the first term is 
the instantaneous loss at the present round, and the second term is the expected cost for the future. Consider 
a randomized strategy q which is defined by first randomly drawing w ~ p and then computing 

f{w) = argmin sup(^(/, x) +^{w,x)) 
f 
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for the random draw w. We then verify that 

S{q) = sup(E/.q^'(/,x) +E^.p$(w,a;)) = sup (E^.p*(/(w), x) + E,„.p$(w, a;)) 

X X 

< E^^p sup (*(/(«;), a;) + <^{w, x)) = E^^p inf sup (*(/, x) + x)) . 

X fx 

What makes the proof of admissibility possible is that the infimum in the last expression is inside the 
expectation over w rather than outside. We can then appeal to the minimax theorem to prove admissibility. 

In our examples, is the loss at round t and $ is the relaxation term, such as the sequential Rademacher 
complexity. In Section 7.4 we show that, if we can compute the "worst" tree x, we can randomly draw a path 
and use it for our randomized strategy. Note that the worst-case trees are closely related to random walks 
of maximal variation, and our method thus points to an intriguing connection between regret minimization 
and random walks (see also [ , :] for related ideas). 

Interestingly, in many learning problems it turns out that the sequential Rademacher complexity and the 
classical Rademacher complexity arc within a constant factor of each other. In such cases, the function $ 
does not involve the supremum over a tree, and the randomized method only needs to draw a sequence of 
coin flips and compute a solution to an optimization problem slightly more complicated than ERM. 

In particular, the sequential and classical Rademacher complexities can be related for linear classes in finite- 
dimensional spaces. Online linear optimization is then a natural application of the randomized method we 
propose. Indeed, we show that Follow the Perturbed Leader (FPL) algorithm [lU] arises in this way. We 
note that FPL has been previously considered as a rather unorthodox algorithm providing some kind of 
regularization via randomization. Our analysis shows that it arises through a natural relaxation based on 
the sequential (and thus the classical) Rademacher complexity, coupled with the random playout idea. As 
a new algorithmic contribution, we provide a version of the FPL algorithm for the case of the decision sets 
being £2 balls, with a regret bound that is independent of the dimension. We also provide an FPL-style 
method for the combination of £1 and £00 balls. To the best of our knowledge, these results are novel. 

In the later sections, we provide a novel randomized method for the Trace Norm Completion problem, 
and a novel randomized method for the setting of static experts and transductive learning. In general, 
the techniques we develop might in future provide computationally feasible randomized algorithms where 
deterministic ones are too computationally demanding. 

7.1 When Sequential and Classical Rademacher Complexities are Related 

The assumption below implies that the sequential Rademacher complexity and the classical Rademacher 
complexity are within constant factor C of each other. We will later verify that this assumption holds in the 
examples we consider. 

Assumption 1. There exists a distribution D e A{X) and constant C > 2 such that for any t € [T] and 
given any xi, . . . , Xt-i,xt+i, . . . ,xt ^ X and any €4+1, . . . , et £ {±1}; 



T 



sup E sup C Y ed{f.x,)-Lt^i{f) + E^.p[£{f,x)]-l{f,Xt) 



T 



< E sup C ^ £,£(/, a;,)-L,_i(/) 



(20) 



where et is an independent Rademacher random variable and Lt-i{f) = ^if^^i)- 



Under the above assumption one can use the following relaxation 



T t 



RelT(^|a;i,...,a;t)= E E, sup C ^ e,€(/, x,) - E ^(/' ^0 



(21) 



fej^ L i=t+l i=l 
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which is a partially symmetrized version of the classical Rademacher averages. 



The proof of admissibility for the randomized methods based on this relaxation is quite curious - the fore- 
caster can be seen as mimicking the sequential Rademacher complexity by sampling from the "equivalently 
bad" classical Rademacher complexity under the specific distribution D given by the above assumption. 

Lemma 11. Under the Assumption 1, the relaxation in Eq. (21) is admissible and a randomized strategy 
that ensures admissibility is given by: at time t, draw Xt+i, ■ ■ ■ ,xt ~ D and Rademacher random variables 
e = (e^+i, . . . , ex) and then : 

1. In the case the loss £ is convex in its first argument and the set T is convex and compact, define 

T t-1 



/t = argmin sup {^(g, x) + sup |c ^ x,,) - ^ x^) - ^(/, x) 



2. In the case of non-convex loss, sample ft from the distribution 

T 



gt = argmin sup Ie/.^ [^(/, x)] + sup jc ^ e,i{f,x,)-Y,Kf,x.i)-e{f,x) 

qeA{J^) xeX { fej^ { i=t+l i=l 



(22) 



(23) 



The expected regret for the method is bounded by the classical Rademacher complexity: 



T 

sup^et^(/, xt) 



Of particular interest are the settings of static experts and transductive learning, which we consider in 
Section 8. In the transductive case, the Xj's are pre-specified before the game, and in the static expert case 
- effectively absent. In these cases, as we show below, there is no explicit distribution D and we only need 
to sample the random signs e's. We easily see that in these cases, the expected regret bound is simply two 
times the transductive Rademacher complexity. 



7.2 Linear Loss 

The idea of sampling from a fixed distribution is particularly appealing in the case of linear loss, £{f,x) = 
{f,x). Suppose X is a unit ball in some norm || • || in a vector space B, and J- is a unit ball in the dual norm 
II ■ II » . Assumption 1 then becomes 

Assumption 2. There exists a distribution D e A{X) and constant C > 2 such that for any t e [T] and 
given any xi, . . . ,Xt-i,Xt+i, . . . ,xt ^ X and any et+i, . . . , et 6 {±1}; 



sup E 



C tiXi -Y^Xi + E [x] - Xt 



i=t+i 



i=l 



< E 



T t-1 
i=t i=l 



For (24) to hold it is enough to ensure that 



sup E 



w + E [x] - Xt 



< E ||u; + Ceta;t|| 

et.Xt~D 



(24) 



(25) 



for any w e B. 



At round t, the generic algorithm specified by Lemma 23 draws fresh Rademacher random variables e and 
Xt+i, . . . ,xt T) and picks 



ft = argmin sup 

/eJF xeX y 



T t-1 



(26) 
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We now look at specific examples of £2/^2 and £i/£oo cases and provide closed form solution of the randomized 
algorithms. 

Example : £i/£oo Follow the Perturbed Leader: 

Here, we consider the setting similar to that in [ ]. Let JF c be the £1 unit ball and X the (dual) £00 
unit ball in K^. In [i ( ], is the probability simplex and X = [0, 1]^ but these are subsumed by the £i/£^ 
case. We claim that: 

Lemma 12. Assumption 2 is satisfied with a distribution D that is uniform on the vertices of the cube 
{±1}^ andC = 6. 

In fact, one can pick any symmetric distribution D on the real line and use for the perturbation. 
Assumption 2 is then satisfied, as we show in the following lemma. 

Lemma 13. // D is any symmetric distribution over the real line, then Assumption 2 is satisfied by using 
the product distribution . The constant C required is any C > 6/Mx~d\x\. 



The above lemma is especially attractive when used with standard normal distribution because in that case 
as sum of normal random variables is again normal. Hence, instead of drawing xt+i, . . . ,xt ~ N{0, 1) on 
round t, one can simply draw just one vector Xt ~ N{0, \/T-t) and use it for perturbation. In this case 
constant C is bounded by 8. 

While we have provided simple distributions to use for perturbation, the form of update in Equation (26) is 
not in a convenient form. The following lemma shows a simple Follow the Perturbed Leader type algorithm 
with the associated regret bound. 

Lemma 14. Suppose T is the unit ball and X is the dual £^ unit ball, and let D be any symmetric 
distribution. Consider the randomized algorithm that at each round t freshly draws Rademacher random 
variables Ct+i, ■ ■ ■ ,eT o,nd freshly draws Xt+i, . . . ,xt ~ (each co-ordinate drawn independently from D) 
and picks 

(t-l T 
i=i i=t+i 

where C = 6/Ex~D [\^\]- The randomized algorithm enjoys a bound on the expected regret given by 

T 



E [Reg-r] < C EE, 



T 

T^etxt 



+ 4 



t=l V i=t+l / 



Notice that for D being the {±1} coin flips or standard normal distribution, the probability 



yt+i,...,yT~D I 



V i=t+l I 



is exponentially small in T-t and so Y^=i 'Pyt+i,---,yT~D (C |E^t+i ^ 4) is bounded by a constant. For these 
cases, we have 



( E Ee 


T 




t=l 



] = 0(^T\0gN) 



This yields the logarithmic dependence on the dimension, matching that of the Exponential Weights algo- 
rithm. 



Example : 1^2/^2 Follow the Perturbed Leader: 
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We now consider the case when T and X are both the unit ^2 ball. We can use as perturbation the uniform 
distribution on the surface of unit sphere, as the following lemma shows. This result was already hinted 
at in [2], as the random draw from the unit sphere is likely to produce an orthogonal direction, yielding a 
strategy close to optimal. However, we do not require dimensionality to be high for the result to hold. 

Lemma 15. Let X and T be unit balls in Euclidean norm. Then Assumption 2 is satisfied with a uniform 
distribution D on the surface of the unit sphere with constant C = 4\/2. 



Again as in the previous example the form of update in Equation (26) is not in a convenient form and this 
is addressed in the following lemma. 

Lemma 16. Let X and T be unit balls in Euclidean norm, and D be the uniform distribution on the 
surface of the unit sphere. Consider the randomized algorithm that at each round ( say round t ) freshly draws 
Xt+i , . . . ,xt D and picks 



ft 



Zi=i Xi + C Zi=t+i Xj 



where C = \\f2. The randomized algorithm enjoys a hound on the expected regret given by 



E[Regj,] <CE,,,.. 



,xt~D 



<W2T 



Importantly, the bound does not depend on the dimensionality of the space. To the best of our knowledge, 
this is the first such result for Follow the Perturbed Leader style algorithms. 

Remark 1. The FPL methods developed in [10, 7] assume that the adversary is oblivious. With this sim- 
plification, the algorithms can reuse the same random perturbation drawn at the beginning of the game. It 
is then argued in [ ] that the methods also work for non-oblivious opponents since the FPL strategy is fully 
determined by the outcomes played by the adversary [., Remark 4-^]- contrast, our proofs directly deal 
with the adaptive adversary. 



7.3 Supervised Learning 

For completeness, let us state a version of Assumption 1 for the case of supervised learning. That is, the 
side information xt is presented to the learner, who then picks ijt and observes the outcome yt. 

Assumption 3. There exists a distribution D e A{X x y) and constant C > 2 such that for any t e [T] and 
given any {xi,yi), {xt-i,yt-i),{xt+i,yt+i), . • ■ , (xT^yr) eX xy and any ct+i, . . . ,eT e {±1}, 



sup sup E sup 



< E sup 



C ^ eaifix^,y^)-Lt-lif)+Ey.p^[iifixt),y)]-eifixt),yt) 

i=t+l 

CY.edU{x,),y.)-Lt-iU)\, 

i=t 

where et is an independent Rademacher random variable and Lt-i{f) = ^'(f(xi),yi). 
Under the Assumption 3, we can use the following relaxation: 

RelT {T\{xi,yi),...,{xt,yt)) = 



E 

et+liT 



sup 



C E ^^i(f{x^),y^)-te{f{x^),y^) 
i=t+l i=l 



(27) 
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Lemma 17. Under the Assumption 3, the relaxation in Eq. (27) is admissible and a randomized strategy 
that ensures admissibility is given by: at time t, draw {xt+i,yt+i), • ■ • , {xt, Vt) ~ D and Rademacher random 
variables et+i, - ■ ■ , o,nd then : 



1. In the case the loss £ is convex in its first argument, define 

T 



yt = argmin sup ^ (y , j/t ) + sup 

ye[-B,B] yt^y { i^T 



t 

r 

i=l 



and 



2. In the case of non-convex loss, pick yt from the distribution 



qt = argmin sup \ E.g^g [£{y, yt)] + sup 

qeA([-B.B]) yt^y [ fej' 

The expected regret bound of the method ( in both cases ) is 



C E ^^^{f{^^).y^)-jl^{f{x^).y^) 
i=t + l i=l 



E [Regy] < C E Ee 

{xi,yx),...,(xT,yT)~D 



supY, <^t£{f{xt),yt) 



(28) 



(29) 



7.4 Random Walks with Trees 



We can also define randomized algorithms without the assumption that the classical and the sequential 
Rademacher complexities are close. Instead, we assume that we have a black-box access to a procedure that 
on round t returns the "worst-case" tree x* of depth T - t. 



Lemma 18. Given any xi, . . . ,Xt-i let 



X = argmax E^ sup 



2 Q^(/,x.(e))-^£(/,x,) 



(30) 



Consider the randomized strategy where at round t we first draw Ct+i, ■ • ■ , et uniformly at random and then 
further draw our move ft according to the distribution 



qt{e) = argmin sup j E/^^, [^(/t, Xt)] + sup 



2 eAf,^l{e))-Y^iL^^) 



t 

E 

i=t+l i=l 

The expected regret of this randomized strategy is bounded by sequential Rademacher complexity: 

E [Reg J,] <5Ht(^) . 



(31) 



Thus, if for any given history xi , . . . , xt-i we can compute x* in (30), or even just draw directly a random path 
:x.\{e), . . . ,x^_j(e) on each round, then we obtain a randomized strategy that in expectation can guarantee 
a regret bound equal to sequential Rademacher complexity. Also notice that whenever the optimal strategy 
in (31) is deterministic (e.g. in the online convex optimization scenario), one does not need the double 
randomization. Instead, in such situations one can directly draw ei, . . . ,eT-t and use 

/t(e) = argmin sup |^(/t, Xt) + sup (2 Y e^^(/> x-(e)) - E ^(/' 
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8 Static Experts with Convex Losses and Transductive Online 
Learning 



We show how to recover a variant of the forecaster of [S], for static experts and transductive onhne 
learning. At each round, the learner makes a prediction qt e [-1, 1], observes the outcome yt e [-1, 1], and 
suffers convex L-Lipschitz loss £{qt,yt)- Regret is defined as the difference between learner's cumulative loss 
and inf f^p Kf[t],yt), where F c [-1,1]^ can be seen as a set of static experts. The transductive setting 
is equivalent to this: the sequence of Xt's is known before the game starts, and hence the effective function 
class is once again a subset of [-1,1]^. 

It turns out that in the static experts case, sequential Rademacher complexity boils down to the classical 
Rademacher complexity (see and thus the relaxation in (15) can be taken to be the classical, rather 

than sequential, Rademacher averages. This is also the reason that an efficient implementation by sampling 
is possible. Furthermore, for the absolute loss, the factor of 2 that appears in the sequential Rademacher 
complexity is not needed. For general convex loss, one possible relaxation is just a conditional version of the 
classical Rademacher averages: 



RgIt (^|2/1, ■■■,yt)= ^et+VT sup 



2L eJ[s]-Lt{f) 



(32) 



where Lt(f) = Ss=i ^(/[s], j/s)- This relaxation can be shown to be admissible. 

First, consider the case of absolute loss £{qt,yt) = lit ~yt\ and binary-valued outcomes yt e {±1}- In this case, 
the solution in (15) yields the algorithm 



JeF \s=t+l 



e,/M-Lt_i(/)+/M]-sup( E eJ[s]-Lt-i{f)-m 



which corresponds to the well-known minimax optimal forecaster for static experts with absolute loss [7]. 
Plugging in this value of qt into Eq. (14) proves admissibility, and thus the regret guarantee of this method 
is equal to the classical Rademacher complexity. 

We now derive two variants of the forecaster for the more general case of L-Lipschitz loss and yt e [-1, 1]. 



First Alternative : If (32) is used as a relaxation, the calculation of prediction yt involves a supremum 
over f i F with (potentially nonlinear) loss functions of instances seen so far. In some cases this optimization 
might be hard and it might be preferable if the supremum only involves terms linear in /. This is the idea 
behind he first method we present. To this end we start by noting that by convexity 

E e{yt,yt) - inf Z £{f{xt),yt) < E de{yt, yt)-yt- inf E d^iit, Vt) ■ M 
t=i J ^-^ t=i t=i j^-^t=i 

Now given the above, one can consider an alternative online learning problem which, if we solve, also solves 
the original problem. That is, consider the online learning problem with the new loss 

£'{y,r) = r-y 

In this alternative game, we first pick prediction yt (deterministically) , next the adversary picks rt (corre- 
sponding to rt = d£{yt,yt) for choice of yt picked by adversary). Now note that £' is indeed convex in its 
first argument and is L Lipschitz because \d£{yt,yt)\ < L. This is a one dimensional convex learning game 
where we pick yt and regret is given by 

Reg^ = E d£{yt,yt) ■ m - inf E d£(yt,yt) ■ M 
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One can consider the relaxation 



RelT {J'\de{yi,yi), . . . ,de{yt,yt)) = Ee^^i^^ sup 



2L eJ[t]-Y,d£{m,y.)-m 



(33) 



as a linearized form of (32). At round t, the prediction of the algorithm is then 



1 



t-i 



feF U=t+1 



i=l 



1 



2L 



i=l 



(34) 



Lemma 19. T/ie relaxation in Equation (33) is admissible with respect to the prediction strategy specified 
in Equation (34). Further the regret of the strategy is bounded as 



Reg J, < 2L Ee 



supEet/[t] 



The presented algorithm is similar in principle to , with the main difference that computes the infima 
over a sum of absolute losses, while here we have a more manageable linearized objective. Note that while 
we need to evaluate the expectation over e's on each round, we can estimate yt by sampling e's and using 
McDiarmid's inequality to argue that, with enough draws, our estimate is close to ijt with high probability. 
What is interesting, we can develop a randomized method that only draws one sequence of e's per step, as 
shown next. 



Second Alternative : Consider the non-linearized relaxation 

T 



RelT {T\yi,...,yt) = E, 



sup 2L E eJ[i]-Y,i{m,y,) 

feF i=t+l i=l 



(35) 



already given in (32). We now present a randomized method based on the ideas of Section 7: at round t we 
first draw 64+1, . . . , and predict 



(36) 



We show that this predictor in expectation enjoys regret bound of the transductive Rademacher complexity. 
More specifically we have the following lemma. 

Lemma 20. The relaxation specified in Equation (35) is admissible w.r.t. the randomized prediction strategy 
specified in Equation (36). Further the expected regret of the randomized strategy is bounded as 



sup^et/W 



E [Regy] < 2L E, 

In the next section, we employ both alternatives to develop novel algorithms for matrix completion. 



9 Matrix Completion 

Consider the problem of predicting unknown entries in a matrix (as in collaborative filtering). We focus 
here on an online formulation, where at each round t the adversary picks an entry in an to x n matrix and a 
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value yt for that entry (we shall assume without loss of generality that n>m). The learner then chooses a 
predicted value j/t, and suffers loss £{yt,yt), which we shall assume to be p-Lipschitz. We define our regret 
with respect to the class T which we will take to be the set of all matrices whose trace-norm is at most B 
(namely, we can use any such matrix to predict just by returning its relevant entry at each round). Usually, 
one sets B to be on the order of ^/mn. 

We consider here a transductive version, where the sequence of entry locations is known in advance, and 
only the entry values are unknown. We show how to develop an algorithm whose regret is bounded by the 
(transductive) Rademacher complexity of T. We note that in Theorem 6 of [17], this complexity was shown 
to be at most order i?^/n independent of T. Moreover, in [8], it was shown that for algorithms with such 
guarantees, and whose play each round does not depend on the order of future entries, under mild conditions 
on the loss function one can get the same regret even in the "fully" online case where the set of entry locations 
is unknown in advance. Algorithmically, all we need to do is pretend we are in a transductive game where 
the sequence of entries is all mx n entries, in some arbitrary order. In this section we use the two alternatives 
provided for transductive learning problem in the previous subsection and provide two alternatives for the 
matrix completion problem. 

We note that both variants proposed here improve on the one provided by the forecaster in [b], since 
that algorithm competes against the smaller class J^' of matrices with bounded trace-norm and bounded 
individual entries. In contrast, our algorithm provides similar regret guarantees against the larger class of 
matrices only whose trace-norm is bounded. Moreover, the variants are also computationally more efficient. 



First Alternative : The algorithm we now present is obtained by using the first method for online 
tranductive learning proposed in the previous section. The relaxation in Equation (33) for the specific 
problem at hand is given by. 



Rel^(yi,...,2/t) = BE, 



T t 
i=t+l i=l 



(37) 



In the above ||-||^ stands for the spectral norm and each Xi is a matrix with a 1 at some specific position 
and elsewhere. That is Xi at round i can be seen as the entry of the matrix which we are asked to fill in 
at round i. The prediction at round t returned by the algorithm is given by Equation (34) which for this 
problem is given by 



yt = BE, 



T 



1 I 

di{yi,yi)xi + -xt 



2p 



T 



1 

2~p 



1 



-Xt 



Notice that the algorithm only involves calculation of spectral norms on each round which can be done 
efficiently. Again as mentioned in previous subsection, one can evaluate the expectation over random signs 
by sampling e's on each round. 



Second Alternative : The second algorithm is obtained from the second alternative for online trans- 
ductive learning with convex losses in the previous section. The relaxation given in Equation (35) for the 
case of matrix completion problem with trace norm constraint is given by: 



RelT(^|yi,...,yt) = E, 



T t 

sup 2p ^ (/,a;j) - ^^((/,a;j) ,?/i) 

/:||/IIe<S i=t+l i=l 



where || • || stands for the race norm of the mxn matrix / and each Xi is a matrix with a 1 at some specific 
position and elsewhere. That is Xi at round i can be seen as the entry of the matrix which we are asked 
to fill in at round i. We use (/, a;) to represent the generalized inner product of the two matrices. Since we 
only take inner products with respect to the matrices Xi, each {f,Xi) is simply the value of matrix / at the 
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position specified by Xi^s. The prediction at a matrix entry corresponding to position xt is given by first 
drawing random {±1} valued e's and then applying Equation (36) to the problem at hand, yielding 



T t-1 I 



inf 

II/IIe^s 



T ^ t-\ ^ 

i=t+l ^P i=l ^ 



Notice that the above involves solving two trace norm constrained convex optimization problems per round. 
As a simple corollary of Lemma 20 we get the following bound on expected regret of the algorithm. 

Corollary 21. For the randomized 'prediction strategy specified above, the expected regret is hounded as 



E [Reg-r] < 2B p E, 





T 












t=l 


(7- 



<0{B P {^/^+^/^)) 



The last inequality in the above corollary is using Theorem 6 in [t "] . 
Corollary 22. For the predictions yt specified above, the regret is bounded as 

Reg^- <0(Bp {\fm + \fn) ) 



10 More Examples 



10.1 Constrained Adversaries 



We now show that algorithms can be also developed for situations when the adversary is constrained in the 
choices per step. Such constrained problems have been treated in a general non- algorithmic way in [10], and 
we picked the case of variation-constrained adversary for illustration. It is shown in [16] that the value of 
the game where the adversary is constrained to keep the next move Xt within ot from the average of the 
past moves ^ Zs=i is upper bounded as 



Vt < 2 sup Ee 

(x,x')er 



sup 



^e,((/,x,(e))--i-g(/,x.M)] 



(38) 



where the supremum is over x, x' trees satisfying the above mentioned constraint per step, and the selector 
Xt(ef) is defined as X((e) if et = -1 and X((e) otherwise. In our algorithmic framework, this leads to the 
following problem that needs to be solved at each step: 



inf sup] (/t,a;t) + 2 sup E^ 

ft Xt I (x,x')eT 



sup I /, 



E ^si^si^)-^- E Xrier)] - Xr] 
s=t+l \ ^ T=t+1 / r-1 /. 



where the supremum is taken over xt such that the constraint C(xi, . . . ,xt) is satisfied and T is the set of 
trees that satisfy the constraints as continuation of the prefix xi, . . . ,xt. While this expression gives rise to 
an algorithm, we are aiming for a more computationally feasible method. In fact, passing to an upper bound 
on the sequential Rademacher complexity yields the following result. 

Lemma 23. The following relaxation is admissible and upper bounds the constrained sequential complexity 



RgIt {T\xi,. ..,xt) 



2V2R 



t 

r=l 



c E 



Furthermore, the an admissible algorithm for this relaxation is Mirror Descent with a step size given at time 
t>2 by 



Xt-l 
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10.2 Universal Mirror Descent 



In [18] it is shown that for the problem of general online convex optimization, the Mirror Descent algorithm 
is universal and near optimal (up to poly- log factors). Specifically, it is shown that there always exists an 
appropriate function 5* such that the Mirror Descent algorithm using this function, along with an appropriate 
step size, gives the near optimal rate. Moreover, it is shown in [I S] that one can use function \I' whose convex 
conjugate is given by 



'^*{x) = sup Eg 



T-t 



■c5E.[l|x.(e)r] 



(39) 



as the "universal regularizer" for the Mirror Descent algorithm. We now show that this function arises 
rather naturally from the sequential Rademacher relaxation and, moreover, the Mirror Descent algorithm 
itself arises from this relaxation. 

Let us denote the convex cost functions chosen by the adversary as It, and let xt be the subgradients 
xt = Vitift) of the convex functions. 



Lemma 24. The relaxation 



RelT {T\xi, ...,xt) = |g a;, j + |v** |e a;, j + C{T-t + 1) j 



i/p 



is an upper bound on the conditional sequential Rademacher complexity. Further, whenever for some p' > p 
we have that Vt{^) ^ {CT^Ip , then the relaxation is admissible and leads to a form of Mirror Descent 
algorithm with regret bounded as 



It is remarkable that the universal regularizer and the Mirror Descent algorithm arise naturally, in a few 
steps of algebra, as upper bounds on the sequential Rademacher complexity. 
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A PROOFS 



Proof of Proposition 1. By definition, 

Y ^h~M, xt) - inf E ^(/, xt) < E ^h~Mu xt) + RelT {T\xi, ...,xt) ■ 
t=i J^-^ t=i t=i 

Peeling off the T-th expected loss, we have 

T T-1 

Y,Kn-^qAft,Xt) + RelT . ..,Xt)<Z ^h-iAfu Xt) + {Ef,.gAft,xt) + Rely {T\xi, . . .,xt)} 



t=i 



t=i 

T-l 



< ^ Ej^.gJifuXt) + RelT (^ki, . . . ,XT-i) 
t=i 

where we used the fact that qt is an admissible algorithm for this relaxation, and thus the last inequality 
holds for any choice xt of the opponent. Repeating the process, we obtain 

Y,Ef,.M,xt) - M Y^iif^xt) < RelT(^) . 

We remark that the left-hand side of this inequality is random, while the right-hand side is not. Since the 
inequality holds for any realization of the process, it also holds in expectation. The inequality 

holds by unwinding the value recursively and using admissibility of the relaxation. The high-probability 
bound is an immediate consequences of (5) and the Hoeffding-Azuma inequality for bounded martingales. 
The last statement is immediate. □ 



Proof of Proposition 2. Denote Lt{f) = Es=i ^(/i Xg)- The first step of the proof is an application of the 
minimax theorem (wc assume the necessary conditions hold): 

2 ^ eJ{f,^s-t{^t^i.s-i))-Lt{f) 



inf sup E [£(/i,a;i)] + supEej^j.^ sup 



s=t+l 



sup inf ^ E [i{ft,xt)]+ E sup Ee^^^.^, sup 



2 Y ^si{f,^s-t{^t^l:s-l))-Lt{f) 



For any pt e A{X), the infimum over ft of the above expression is equal to 



E supEjj^^.^ sup 



2 ^ e,f(/,x,_t(et+i.,_i))-Lt,i(/)+ inf E [e{ft,Xt)] - e{f,Xt) 



< E supEej^j.y sup 

xt~Pt X ■ fej^ 



2 eJ{f,^s-t{^t^i:s-i))-Lt-i{f)+ E m,xt)]-i{f,xt) 



s=t+l 
T 



xt~Pt 



< E supEej^j.^ sup 

xt,x'^~pt X /e:r 



2 Y es£{f,^s-t{^t.i.s-i))-Lt-i{f)+e{f,x[)-£{f,Xt) 

s=t+l 



We now argue that the independent Xt and x J have the same distribution pt , and thus we can introduce a 
random sign et- The above expression then equals to 



E EsupE^j^j.y sup 

xt,x[~ptit X ■ fej^ 



2 Y ^si{f,^s-t{^t^i.s-i))-Lt-i{f) + ^Mf,x't)-£{f,xt)) 



< sup EsupEgj^j.y sup 



2 Y eAf,^s-t{^t.i.s-i))-Lt-i{f)+^tm,x[)-£{f,xt)) 
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where we upper bounded the expectation by the supremum. Sphtting the resulting expression into two parts, 
we arrive at the upper bound of 



2 sup EsupEjj^j.y sup 



The last equality is easy to verify, as we are effectively adding a root Xt to the two subtrees, for et = +1 and 
et = -1, respectively. 

One can see that the proof of admissibility corresponds to one step mininiax swap and symmetrization in 
the proof of [ ]. In contrast, in the latter paper, all T minimax swaps are performed at once, followed by 
T symmetrization steps. □ 

Proof of Proposition 3. Let us first prove that the relaxation is admissible with the Exponential Weights 
algorithm as an admissible algorithm. Let Lt{f) = Zi=i ^(/i ^0- L^t A* be the optimal value in the definition 
of UbIt {y^\xi , . . . , Xt-i ) . Then 

inf sup] E [e{f,Xt)] + RelT{T\xi,...,Xt) 

< inf sup E [£{f,xt)] + ^log\Y,exp{-X*Lt{f))]+2X*{T-t)\ 

Let us upper bound the infimum by a particular choice of q which is the exponential weights distribution 

qt{f)=eM-^*Lt-,{f))/Zt-i 
where Zt-i = ^/^.F^xp (-A*if_i(/)). By [ , Lemma A.l], 

^log|^^exp(-AUi(/))j = ^log(E/.,,exp(-A*^(/,xt))) + ^logZ^^i 



<-Ef^^J{f,Xt) + ^ + ^\ogZt_, 



Hence, 



inf sup{ E [£(/,xO]+RelT(^ki,...,Xt)|< -i^logf E exp(-A*L4_i(/))| + 2A*(T-t+l) 

= Rely {^\X1, . . .,Xt-l) 

by the optimality of A*. The bound can be improved by a factor of 2 for some loss functions, since it will 
disappear from the definition of sequential Rademacher complexity. 

We conclude that the Exponential Weights algorithm is an admissible strategy for the relaxation (8). 



Arriving at the relaxation We now show that the Exponential Weights relaxation arises naturally as 
an upper bound on sequential Rademacher complexity of a finite class. For any A > 0, 



E, 



sup 



( T-t 



^(/,x,(e))-Lt(/) 



< ^log|E, 



< 3^1og E, 



sup exp 



|2AEe^^(/,x.(e))-ALt(/)j j 

Eexp(2AEe.^(/,x,(e))-ALt(/)) | 
fey^ \ 1=1 I \) 



log Eexp(-AL4(/))E, 



T-t 



nexp(2Ae,£(/,x,(e))) 

i=l 
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We now upper bound the expectation over the "future" tree by the worst-case path, resulting in the upper 
bound 

|log| ^exp(-ALt(/))xexp(2A2 max ^j:i{f,Me))A] 

A y^^yr \ ci,...eT-te{±l} i=i /J 

<^log( ^exp(-ALt(/)+2A2 max ^Z^ifM^))']] 

A ^f^jr \ ei,...£T-t6{±l} IJ 

^ Tl°g( E exp(-ALt(/)) I + 2Asupsup max V ^(/' 

A \f^jr J X /e:Fei....eT-te{±l} j=i 

The last term, representing the "worst future", is upper bounded by 2A(T-f). This removes the x tree and 
leads to the relaxation (8) and a computationally tractable algorithm. □ 

Proof of Proposition 4- The argument can be seen as a generalization of the Euclidean proof in [-] to 
general smooth norms. The proof below not only shows that the Mirror Descent algorithm is admissible for 
the relaxation (9), but in fact shows that it coincides with the optimal algorithm for the relaxation, i.e. the 
one that attains the infimum over strategies. 

Let Xt-i = Zili ^i- The optimal algorithm for the relaxation (9) is 

ft = argmin \ sup \ {f,Xt) + J \\S:t-if + (v \\S:t-if ,Xt) + C{T-t+l) 

Now write any ft as ft = -aV ||5t-i V + 9 for some g e Kernel(v||it-i IP) - {/i ■ ( V||it-i P, ^} = O}, and any xt 
as Xt = (ixt-i + 72/ for some y 6 Kernel(V||xt-ip). Hence we can write: 



{ft,xt) + {\\it-if + {V\\it-if,xt) + C{T-t+l)f^ 

= -ap\\xt-i f + 7 (g, y) + {Wxt-i f + PWit-i f + c{T-t + i)y'^ (40) 



-1+ 1)^^'^ 

l|2 , ■ 1 1 -.Li 1/^ II ~ l|2x 



Given any ft = -aV + 9, x can be picked with y 6 Kernel(V ||it-i|| ) that satisfies (g,?/) > 0. One can 

always do this because if for some y' , {g,y') < by picking y = -y' we can ensure that {g,y) > 0. Hence the 



minimizcr ft must be once such that ft = -aV ||a;t_ip and thus {g,y) = 0. Now, it must be that a > so 
that Xt either increases the first term or second term but not both. Hence we have that ft = -aV ||ait-i|| for 
some a > 0. Now given such an /i, the sup over xt can be written as supremum over /3 of a concave function, 
which gives rise to the derivative condition 

11- l|2 n 

-a||xf-i|| + — =0 
2^J\\it-lf + |3\\it-lf + CiT-t+l) 

At this point it is clear that the value of 

1 



2A/||it_i||' + c(r-/ + i) 

forces /3 = 0. Let us in fact show that this value is optimal. We have 



(41) 



1 11- l|2 oil- ||2 
= WXt 1 

4a 



2 \it-i\\ +f3\\it-i\\ +C{T-t+l) 



Plugging this value of /3 back, we now aim to optimize 



1 9 

— +a\\xt-i\\+aC{T-t+l) 
4a 
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over a. We then obtain the value given in (41). With this value, we have the familiar update 

||2 



V \\xt-i\ 



2^J\\it-it + C{T-t + l) 
Plugging back the value of a, we find that /3 = 0. With these values, 

,1/2 



(42) 



inf sup ] (/, x) + {Wit-i t + ( V||it-i f,x) + C{T-t+l)) ' \\ = {Wit-i f + C{T-t+ 1)) 



1/2 



< {\\xt-2f + {V\\^t-2f,xt-i) + C{T-t + 2))' = RelT {T\x,, . . . ,xt_i) 
We have shown that (42) is an optimal algorithm for the relaxation, and it is admissible. 

Arriving at the Relaxation The derivation of the relaxation is immediate: 

T t 
X s=t+l s=l 

T 

< sup " 



sup 




t 

s=l 



+ CEe,,,..r E ||e,x,_t(et+i.,_i)|| 



(43) 
(44) 
(45) 



where the last step is due to the smoothness of the norm and the fact that the first-order terms disappear 
under the expectation. The sum of norms is now upper bounded by T - 1, thus removing the dependence on 
the "future" , and we arrive at 



t 


to 




t-1 




t-1 






+ C{T-t) < ^ 










^xt^CiT-t+l) 


s=l 






s=l 




s=l 





\ 



as a relaxation on the sequential Rademacher complexity. 



□ 



Proof of Lemma 8. We shall first establish the admissibility of the relaxation specified. To show admis- 
sibility, let us first check the initial condition: 



Relfe {J^r{k;xu-,xt)\yi^ ■ • ■ = - {ft.Vk) + 2niin|l 

>-(/t,yfe) + 2niin|l 



k 
k 





fc-l 




fe-i 










Yvj 















Vkl 



^-{ft,yk)+ sup {f-ft,-yk) 

/:||/-/t||<2min{l,^} 
k 

^- „ „ inf E(/'%) 

/-/t <2min{l,^}j = l 
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Now, for the recurrence, we have 



{fi,yi) + supEe 



sup 



k-i i 



{ft,yt} - {ft,Y.yji + ^^^^^ 



^ (/»,y») - l/t: Zyj ) + 2min] 1, [supE, 

j=i I ( cri:t) y 



sup 

/:||/-/tl|£2min|l, 
k 



k-i 



k-i i 

3=1 i=i 



< (/., y.) - ( /t, E ) + 2 min ] 1, A| ,^^^^^p7c(A^ 



< {f^.V^) - [fum) + 2min{l, A| ^ \\y^f + (y \\y^-l\\\y^) + C{k-i+l) 

= {ft - ft.Vi) - {lt,m-i) + 2mm|l, ^jy^lli/i-if + (v ||y,-if ,Xi| + C(fc-i + 1) 

and we start block at ft- For the first block, this value is but later on it is the empirical risk minimizer. 
We therefore get a mixture of Follow the Leader (FTL) and Gradient Descent (GD) algorithms. If block 
size is 1, we get FTL only, and when the block size is T we get GD only. In general, however, the resulting 
method is an interesting mixture of the two. Using the arguments of Proposition 9, the update in the block 
is given by 



ft+i = ft- max \ 1 



CTl 



-V ^1 



Now that we have shown the admissibility of the relaxation and the form of update obtained by the relaxation 
we turn to the bounds on the regret specified in the lemma. We shall provide these bounds using Lemma 7. 
We will split the analysis to two cases, one when a > 1/2 and other when a < 1/2. 

Case a > I : 

To start note that since we initialize the block lengths with the doubling trick, that is initialize block lengths 
as 1,2,4,... hence, after t rounds the maximum length of current block say k can be at most 2t and so 
Vk < \/2i. Now let us first consider the case when a > |. In this case, since ai-.t = Bt" , we can conclude that 

the condition (Ji-^t > \/k is satisfied as long as > Since we are considering the case when a > | we 
can conclude that for all rounds larger than \/2/_B, the blocking strategy always picks block size of 1. Hence 
applying Lemma 7 we conclude that in the case when 1 > a > 1/2 (or when a = 1/2 and B > \/2), 

Reg^ < ^ ^ = J = 0(Ti-/S) 

t=l 0-1:4 4=1 -OI" 

Also note that for the case when a = 1, the summation is bounded by O(logT) and so 

Reg^ < f ^ = f ^ = 0(logT/i3) 

4=1 0-1:4 4=1 Bt"' 



Case a < I : 

Now we consider the case when a < 1/2. Say we are at start of some block < = 2™. The initial block length 
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then is 2t by the doubhng trick initiahzation. Now within this block, the adaptive algorithm continues with 
this current block until the point when the square-root of the remaining number of rounds in the block say 
k becomes smaller than (Ji:t+{2t-k)- That is until 

Vk<B{3t-k)°' (46) 

The regret on this block can be bounded using Lemma 7 (notice that here we use the lemma for the algorithm 
within a sub-block initialized by the doubling trick rather than on the entire T rounds). The regret on this 
block is bounded as : 

2t 2t ^ 

Rel2t-k{^r{x^....,xt))+ E Reli {j'r{xi,...,x,)) < ^ "^t - k + ^ -— 

i=2t-fc+l j"=2t-fc+l 

2t 1 

<V2i+ £ 

j=2t-k+l -^J" 

<^/2i+^ ((2t + 1)1-" -{2t-k + 1)1-°) 

< V2t + 

B 

< x/2i + (using Eq. (46)) 

B 

<V2t + B^^^-°'^-^V3i 

< VT2t 

Hence overall regret is bounded as 

[log^Tl + l [1082^1 + 1 

Regj.< ^ \/l2 X 2^-1 < Vl2 2^'-'^^'^ <OiVT) 

i=l 4=1 

This concludes the proof. □ 

Proof of Lemma 9. Notice that by doubling trick for at most first 2r rounds we simply play the experts 
algorithm, thus suffering a maximum regret that is minimum of r and A^Jrlog After these initial number 
of rounds, consider any round t at which we start a new block with the blocking strategy described above. 
The first sub-block given by the blocking strategy is of length at most k, thanks to our assumption about 
the gap between the leader and the second-best action. Clearly the minimizer of cumulative loss up to t 
rounds already played, argmin ZLi ^(/i going to be the leader at least for the next k rounds. Hence 

for this block we suffer no regret. Now when we use the same blocking strategy repeatedly, due to the same 
reasoning, we end up playing the same leader for the rest of the game only in chunks of size fc, and thus 
suffer no regret for the rest of the game. □ 

Proof of Proposition 10. We would like to show that, with the distribution q^ defined in (18), 

max j E \yt-yt\ + ^e\T{T\{x\y'))\<-Re\T{T\{x'-\y'-^)) 
yt6{±i} \vt~q*t J 
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for any xt € X. Let a e {±1}* ^ and at e {±1}- We have 
RelT {T\{x\y'))-2X{T-t) 



^ \ogi ^ g{Ldim{Tt{<J,at)),T -t)exp{-XLt-i{a)}exp{-X\at-yt\} 



A 



<ylog( exp{-X\at-yt\} ^ 5(Ldim( J-^C^, a*)), T - t) exp {-Ait_i(a)} 



Just as in the proof of Proposition 3, we may think of the two choices at as the two experts whose weighting 
qt is given by the sum involving the Littlestone's dimension of subsets of T . Introducing the normahzation 
term, we arrive at the upper bound 



^log(E,^.,.exp{-A|at-2/t|}) + ylog( ^ .9(Ldim( a*)), T - i) exp {-ALt_i(a)} 

^ ^ \at^{±l}a:{a,atyj^\^t J 



<-E,,.,.|at-yt| + 2A+^log( ^ ^ g{Ldim{Tt{a,at)),T - 1) exp{-XLt^i{a)} 



The last step is due to Lemma A.l in [7]. It remains to show that the log normalization term is upper 
bounded by the relaxation at the previous step: 



^logj Y E 9iLdimi:Ftia,at)),T-t)exp{-XLt-iia)} 

\crt6{±l} cr:((T,crf )e;P|^t J 



X 



<^log( Y exp{-ALt_i(a)} Y g{l^dim{Tt{a,at)),T-t) 



<|log( Y exp{-ALt_i(a)}g(Ldim(^t_i(a)),T-t+l)| 

To justify the last inequality, note that Tt-i{a) = ^t(cr, +1) u^t(cr, -1) and at most one of Tt{a,+1) or 
^(((7,-1) can have Littlestone's dimension ljdmi{Tt-i{a)) . We now appeal to the recursion 

g{d,T-t) + g{d-l,T-t)<g{d,T-t+l) 

where g{d,T -t) is the size of the zero cover for a class with Littlestone's dimension d on the worst-case tree 
of depth T - t (see [14]). This completes the proof of admissibility. 



Alternative Method Let us now derive the algorithm given in (19) and prove its admissibility. Once 
again, consider the optimization problem 

max j E \yt-yt\ + Iie\T{T\{x\y'))\ 

with the relaxation 

RelT(^|(a;*,y'))= ^logj ^ g{l.diTn{Tt{a)).T-t)exp{-XLt{a)}\ + \{T-t) 



32 



The maximum can be written explicitly, as in Section 6: 



max]l-gt* + Ylogj ^ g(Ldim( J"t((T, cTf )), T - exp {-ALt_i(cr)} exp {-A(l - dt)} j , 

^ \(fT.CTt)e^Ut / 



l + gt+ylog| ^ g{Ldim{Tt{<y,crt)),T-t)exp{-XLt-i{a)}exp{-X{l + crt)} 



where we have dropped the ^(T - t) term from both sides. Equating the two values, we obtain 

_ 1 T,(a,at)eJ^\^t g(Ldim( J"i ((T, (Tt ) ) , T - t ) CXp { - ALt-1 (o") } exp {- A( 1 - CTt ) } 

A T,(a,at)ej^\^t g(Ldim( (d, (Tf ) ) , T - t ) cxp { - ALj^i (cr) } exp { - A( 1 + CT* ) } 
The resulting value becomes 



l+^{T-t) + ^\og\ .9(Ldim(J-t((T,at)),r-i)exp{-ALt_i(a)}exp{-A(l-at)} 



+ :7^1og| E .9(Ldim(J-,((T,at)),r-i)exp{-Ait_i(a)}exp{-A(l + a()} 



l+^(T-i) + ^E,log] 9{Ldim{Tt{'J,at)),T-t)exp{-XLt-i{a)}exp{-X{l-e<Jt)}\ 



< 1+ ^(T-i) + ^logi Y ff(Ldim(^t(a,(Tt)),r-t)exp{-ALt-i(a)}E,exp{-A(l-eat)} 

for a Rademacher random variable e e {±1}. Now, 

E, exp{-A(l - eat)} = e^^E.e^""* < e^^e^'/^ 
Substituting this into the above expression, we obtain an upper bound of 



^-(r-t+l) + ^log| Y 9{Ldim{Tt{'J,at)),T-t)exp{-XLt-i{a)}\ 



which completes the proof of admissibility using the same combinatorial argument as in the earlier part of 
the proof. 



Arriving at the Relaxation Finally, we show that the relaxation we use arises naturally as an upper 
bound on the sequential Rademacher complexity. Fix a tree x. Let cr e {±1}*"^ be a sequence of signs. 
Observe that given history x* = {xi, . . . , Xt), the signs e e {±1}-'""*, and a tree x, the function class takes 
on only a finite number of possible values (cr, crt,a;) on (a;*,x(e)). Here, x(e) denotes the sequences of values 
along the path e. We have, 

r T-t t 
supE, sup]2 ^ ej(xj(e)) - ^1 l/C^^O 



( T-t f 

■yi|}- = supE£ max max < 2 ^ e^w^ - ^ |o-j - j/j] 

X CTt6{±l} ((T,w):(<j,crt,w)e:P|(^t_^(,)) [ j=i 



<supE£ max max max 2 ^ eiVi(e) - ^ |tTi - 



" <7te{±l} a:(a.at)<iy^\^t veV iy^(<T,at) 



T-t 



i=l 
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where ^|(a;t,x(€)) is the projection of onto (a;*, x(e)), T{(T,<Jt) = {f e : f{x*) = {a, at)}, and V{J^{(t, <Tt),x) 
is the zero-cover of the set T{a, at) on the tree x. We then have the foUowing relaxation: 



^logjsupEe XI E E exp|2A ^ e,Vj(e) - ALi(cr,crt)| j 

■((T,crt),x) I 1=1 ) / 



A 



\ X ate{±l}a:{<7.at)eJ^\^tveV{J^{ 

where Lt{a,at) = ZLi Wi ~ Vil- The latter quantity can be factorized: 

ylogjsup ^ ^ exp{-\Lt{a,at)}E, ^ exp |2A ^ e,v,(e) 



<-log(sup E exp{-XLt{a,at)}caYd{V{T{a,at),^))exp{2X\T-t)} 

, X CTte{±l}o-:(o-,o-t)e;!=-|^t ^ 



A 



<^log[ ^ exp{-A|at-2/i|} ^ g{Ldim{T{a,at)) ,T - t) exp {-XLt-^{a)}\ + 2X{T - t) 



A 

This concludes the derivation of the relaxation 



□ 



Proof of Lemma 11. We first exhibit the proof for the convex loss case. To show admissibility using the 
particular randomized strategy qt given in the lemma, we need to show that 

sup{E/^g^ [l{f,xt)] + RelT {T\xi, . . .,Xt)] < Relr {T\xi,. . .,Xt-i) 

The strategy qt proposed by the lemma is such that we first draw xt+i, . . . ,xt ~ D and et+i, . . .ex Rademacher 
random variables, and then based on this sample pick ft = ft{xt+i:T, ^t+i-.r) as in (22). Hence, 

sup{E/^qj [£{f,xt)] + RelT {J^\xi, ■ . -^xt)} 



sup I E £{ft,x) + E sup 



€t + l:T 



"'t+l-.T 



C ^ ed{f,x,)-Lt{f) 



i=t+l 



< E sup 



«t + l:T xt I 



\£{ft,x) + sup 



C eAf,^^)-Lt{f) 



where Lt{f) = T,l=i£if,Xi)- Observe that our strategy "matched the randomness" arising from the relax- 
ation! Now, with ft defined as 



ft = argmin sup \ ({g, Xt) + sup 



for any given Xt+i:T,^t+i:T, we have 



C Y edU,Xi)-Lt{f) 



supU(/t,xt) + sup 



C E ^^Kf,^^)-Lt{f) 



inf sup i £{g, Xt) + sup 



c E ^^^{^.^^)-U{f) 
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We can conclude that for this choice of qt, 



sup{Ef^q^[£{f,Xt)] + 'RelT{T\xi,...,Xt)} < E inf sup^^(g,a;t) + sup 



t + l:T 



E inf sup V^xt~pt 



i{g,xt) + sup 



C e^i{f,X,)-Lt{f) 



E sup M \ Ex^^p[i{g,Xt)]+ E^^ 

et+l;r peA{X) 3<^^ { 



supC eAf,^^)-Lt{f) 

f^y" i=t+l 



In the last step we appealed to the minimax theorem which holds as loss is convex in g and ^ is a compact 
convex set and the term in the expectation is linear in pt, as it is an expectation. The last expression can 
be written as 



E sup Ext-p sup 



C Y ed{f,x,) - Lt-i{f) + mlExt-.p[£{g,Xt)]- i{f,xt) 



< E sup E^jj-pSup 



C Y, e^£{f,xt) - Lt-i{f) + Ext.p[e{f,xt)]-£{f,xt) 



< E E,,.i5Ee,sup 



G Y ed{f,x,)-Lt-i{f)+Cet£{f,xt) 



= RelT . . .,xt-i) 

Last inequality is by Assumption 1, using which we can replace a draw from supremum over distributions 
by a draw from the "equivalently bad" fixed distribution D by suffering an extra factor of C multiplied to 
that random instance. 

The key step where we needed convexity was to use minimax theorem to swap infimum and supremum 
inside the expectation. In general the minimax theorem need not hold. In the non-convex scenario this is 
the reason we add the extra randomization through qt . The non-convex case has a similar proof except that 
we have expectation w.r.t. qt extra on each round which essentially convexifies our loss and thus allows us 
to appeal to the minimax theorem. □ 

Proof of Lemma 12. Let w e be arbitrary. Throughout this proof, let e e {±1} be a single Rademacher 
random variable, rather than a vector. To prove (25), observe that 



sup E 



w + E[x^- Xt 

x~p 



< sup E ||w + - 

PeA(A')^.^'~P 

= sup E Ej ||w + e(a;' - 

< sup E^\\w + e{x' - x)\\^ 



< sup Ee \\w/2 + ex'W^ + supEe ||it;/2 - ex\\^ 

x'fzX xeX 

= supEe max \wi + 2exi\ 

xiX 

The supremum over a; e A" is achieved at the vertices of X since the expected maximum is a convex function. 
It remains to prove the identity 



max Eg max \wi + 2exi\ < E E max \wi + QexA 

xe{±l}" i6[Af] x~Dei<i[N'] 



(47) 
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Let i* = argmax \wi\ and j* = argmax \wi\ be the coordinates with largest and second-largest magnitude. If 
\wi* \ - \wj* \ > 4, the statement follows since, for any x e {±1}^ and e e {±1}, 

maxiwi + 2exi\ < max|wi| + 2 < \wi* \ - 2 < \wi* + 2exi*\, 

and thus 

max M^maxlwi + 2exi\ = max K^\wi* + 2exi»\ = \wi»\ = e\wi» + Gexitl <¥,x ^inaxlwi + GexA. 

3:6{±1}" k[N] a;e{±l}« ' ' i 

It remains to consider the case when \wi* \ - \ujj* \ < 4. We have that 

Ej. max + 6exi| > Ej- e max + Bex^j > -(jwi*! + 6) + -(juij*] - 6) + + 6) > + 2 (48) 

ii[N] ' 2 4 4 

> max E^maxlwi + 2exi\ , (49) 

2;e{±l}« te[N] 

where 1/2 is the probability that eXi* = sign{wi*), the second event of probability 1/4 is the event that 
exi* + sign{wi*) and exj* + sign{wj*), while the third event of probability 1/4 is that eXi* + sign{wi*) and 
exj* = sign{wj*). □ 

Proof of Lemma 13. Let w e be arbitrary. Just as in the proof of Lemma 12, we need to show 

max Ep max Iw; + 2exi| < E E max lui; + CexJ (50) 

xe{±l}« ie[Af] x~Dti^[N] 

Let i* = argmax \wi\ and j* = argmax \'Wi\ be the coordinates with largest and second-largest magnitude. 

If \wit \ - \'Wj*\ > 4, the statement follows exactly as in Lemma 12. It remains to consider the case when 
jwi. I - I < 4. In this case first note that, 

max Eg max \wi + 2exi\ < \wi* I + 2 

xe{±l}" ie[N] 

On the other hand, since the distribution we consider is symmetric, with probability 1/2 its sign is negative 
and with remaining probability positive. Define (Ji* = sign(a;i*), aj* = sign(xj*), r^* = sign(u'i*), and 
Tj- = sign(ii;j»). Since each coordinate is drawn i.i.d., using conditional expectations we have, 



Ex^e max \wi + Cexi\ = max \wi + Cxi\ 

^ Ex [\wi* + Cxi* \ I = Tj*] Ex [\wj* + Cxj*\ I aj* + T,*,aj* = tj*] E[\wi* + Cxi*\ \ Oj* + n*,aj* + Tj-*] 

2 ^ 4 "^4 

^ Ex [\wi*\ + C\xi*\ I aj* = n*] Ex[\w.j*\ + C\xj*\ \ (Jj* +n*,(j.j* =Tj-*] E [\wi* \ - C\xi* \ \ (Jj* + n* , o-j-» + tj, ] 

2 4 "^4 

^ E[\w^,\ + C\x^.\ I (Jj. =T,,] E[\wj*\ + C\xj*\ I crj> = Tj>] E[\w,.\-C\xi. \ \ a,. + t,»] 

2 4 4 

_ K.| + CE[|a;,.| I a,. =T,.] \w,A + CE[\x,.\ \ a,*=T,.] \w,,\-CE[\x,.\ \ a,. + n,] 



2 4 

2\w,. \ + \w,A + 3CE [\x^. I I g,. = t,. ] ^ \w,, \ - CE [\x,. \\ a,. + ] 

4 "^4 
3K.| + K-.| + 2CE[|a;,.| | a,. =t,.] 



36 



Now since we are in the case when \wi* \ - \wj* \ < 4 we see that 

E^^max Wi + Cexi > — ^—^ ' ' — > ' ' 

i ' ' 4 4 

On the other hand, as we aheady argued, 

max Ee max \wi + lexA < Iwi* I + 2 

Hence, as long as 

C E[|x»*| I cTj* = Tj*] - 2 ^ 2 
2 

or, in other words, as long as 

C > 6/E [\xi\ I sign(xj) = sign(t«j)] = [\x\] , 

we have that 

max Ee max Iwi + 2exJ < E^^ ^ max Iwi + CexA . 

a;€{±l}W i6[JV] ' i 

This concludes the proof. □ 
Lemma 25. Consider the case when X is the ball and T is the unit ball. Let f* = argmin {f,R), 
then for any random vector R, 



Ef 



sup{{r,x) + \\R + x\U 



:ER\misup{(f,x) + \\R + x\\^} 



+ 4P(||i?|L<4) 



Proof. Let /* = argmin (/, R). We start by noting that for any /' € J^, 



svLp{{f',x)+\\R + x\\^} = sup \{f',x) + sup if, R + x) 

xeX x€X { f€j^ 

= sup sup {{/', x) + (f,R + x)} 

X€X 



suplsupif + f,x) + {f, R) 
■■sup{\\f' + f\\, + {f,R)} 



Hence note that 



inf sup{(/',a:) + ||i? + a:L}= inf sup{||/' + /||i + (/,E)} (51) 

/'^■^ x€X /'«-^ /eJF 

> M{\\f'-f*\\,-{f*,R)} > m^{||/'-r||, + \\R\U = \\R\L (52) 

On the other hand note that, /* is the vertex of the £i ball (any one which given by argmin \R[i]\ with sign 

opposite as sign of R[i] on that vertex). Since the £i ball is the convex hull of the 2d vertices, any vector / € 
can be written as f = ah- pf* some /i e JF such that || /i|| ^ = 1 and {h, R) = (which means that /i is on the 
maximal co-ordinate of R specified by /*) and for some l3 e [-1, 1], a e [0, 1] s.t. \\ah - Pf*\i < 1. Further 
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note that the constramt on a,/3 imposed by requiring that \\ah - /3f*\\^ < 1 can be written as a + < 1. 
Hence, 

sup{{r,x) + \\R + x\\^} = sup{\\r + fl + {f,R)} 

= sup sup sup {||(1 - /3)/* + + /3 + a 

a6[0,l] /il/M|ft||i = l/36[-l,l],||a/i-^/'|li<l 

= sup sup sup {|l-/3||irili+a||/i|li + /3||i?L} 

a6[0,l] /i±/M|ft||i = l;g6[-l,l],||a/i-^/'|li<l 

= sup sup {|l-/3| + « + /3||i?IL} 

a6[0,l] /36[-l,l]:t/3|+a<l 

< sup {|l-/?| + l-|/3| + /3||i?||^} 

< sup {2|l-/3|+/3||i?||^} 

1] 

= sup {2|l-/3| + /3||i?L} 

= max{||i?||^,4-||i?L} 
<||i?|L+4 1{||i?|L<4} 

Hence combining with equation 51 we can conclude that 



sup{(/*,a;) + ||i? + a;L} 



inf sup{(/,a;) + ||i? + a;|L} 
inf sup{(/,a;) + ||i? + a;|L} 



+ 4EH[l{||i?L<4}] 
+ 4P(||i?L<4) 



□ 



Proof of Lemma 14- On any round t, the algorithm draws Ct+i, ■ ■ • , ct and Xt+i, . . . , ~ and plays 

t-l T 



ft = argmin ( /, ^ - C ^ 



i=t+l 



We shall show that this randomized algorithm is (almost) admissible w.r.t. the relaxation (with some small 
additional term at each step). We define the relaxation as 



Rely . . . ,xt) = E 



Xt + l;. ..XT~D 



t T 

^^Xi - C ^ Xi 



1=1 i=t+l 

Proceeding just as in the proof of Lemma 11 note that, for our randomized strategy, 
sup {Ef~qt [{f,x)] + RelT (^Ixi, ...,xt)} 

X 

t-l T 



SUp|E^^^^^^^£,iv [(/t,x)] + E^^^^,^^DN 



Y,Xi + x-C Xt 

4=1 i=t+l 



<E 



s^Pi {ft, x) + 



t-l T 

Xi + X — c ^ X, 

i=l i=t+l 



(53) 
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In view of Lemma 25 (with R = ELi ~ C^JLt+i ^i^i) we conclude that 



^Xt+l,...,XT 

<E 



sup \{ft,x) + 



Xt + 1,...,XT 



i=l i=t+l 

t-1 



mi sup \ if, x) + 

X 



C* Xj + : 

i=l i=t+l 



■4 P 



E 



Xt + 1....,XT 



snp\(ft,x) + 



^ C Xj + : 



+ 4 P 



t-l T 

^ - C ^ X, 

1=1 i=t+l 
t-l T 
^Xi - C ^ Xi 
i=l i=t+l 



<4 



<4 



where 



ft = argmin sup-^ + 



t-l T 

"Y^Xi - C ^ Xi + : 

i=l i=t+l 



Combining with Equation (53) we conclude that 
sup {E/~5t [(/: 2;)] + Relr {^\xi ,...,xt)} 



<E 



Xt + 1,...,XT 



Now, since 
4 P 

we have 



t-l T 



supN/i*,a;) + 



<4 



t-l T 

^ — C ^ Xi + : 

i=l i=t+l 



+ 4 P 



t-l T 

"Y^Xi - C ^ X, 

j=l i=t+l 



<4 



C 



).4P( 

sup{E/.,, [(/,a;)] +RelT (JP|xi, . . . , Xt)} 



T 



<4 <4P,,^„...,,,.^ C 



E ^4] 

V i=t+l / 



<E 



Xt+1,...,XT 



f-1 

^^Xi - C ^ Xi + : 

1=1 



+ 4 P 



Vt+l, ■■-.Vt-D I 



(c^ E y< ^4] 

\ ■i=t+i / 



(54) 
(55) 



In view of Lemma 13, Assumption 2 is satisfied by with constant C. Further in the proof of Lemma 
11 we already showed that whenever Assumption 2 is satisfied, the randomized strategy specified by is 
admissible. More specifically we showed that 



E 



Xt+1....,XT 



t-l T 

Xi - C ^ Xi + : 



<'RelT{F\xi,...,xt-i) 



and so using this in Equation (54) we conclude that for the randomized strategy in the statement of the 
lemma, 



sup {Ef^q^ [(/, x)] + Relr {T\xi ,...,xt)} 



{c Zv^ ^ 4) 

\ i=t+l / 



< RelT {F\xi,. . .,xt-i) +4 Pyt^,,...^yj,~D 1 

Or in other words the randomized strategy proposed is admissible with an additional additive factor of 
4 'Pyt+i,---,VT~D (C |Z^t+i yj| ^ 4) at each time step t. Hence by Proposition 1 we have that for the randomized 
algorithm specified in the lemma. 



E[Regj,]<RelT(F) + 4£p,,^„...,,,.Z3(c E ^ 4^ 

t=i V i=t+i / 



CE, 





T 






E^i 






t=l 


00 - 



+ 4EPy,,i,...,y^~D c 



(c E ^ 4] 

\ i=f+l / 



This concludes the proof. 



□ 
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Proof of Lemma 1 5. Instead of using C = 4^2 and drawing uniformly from surface of unit sphere we can 
equivalently think of the constant as being 1 and drawing uniformly from surface of sphere of radius 4\/2. 
Let ll'll stand for the Euclidean norm. To prove (25), first observe that 



sup E 



w + E [x] - Xt 

x~p 



< supE \\w + 2ex\\ 



(56) 



for any w e B. Further, using Jensen's inequality 

sup E II + 2ea; II < sup ^ /E||u; + 2ea;f < sup , /||wf + E ||2e2:f = J\\wf + A 

xeX « xeX V « xeX V « 

To prove the lemma, it is then enough to show that for r = 4\/2 

Ex~D \\w + rx\\ > \/\\wf +4 



(57) 



for any w, where we omitted e since D is symmetric. This fact can be proved with the following geometric 
argument. 

We define quadruplets (w + zi,w + Z2jW - zi,w - Z2) of points on the sphere of radius r. Each quadruplets 
will have the property that 



lui + zill + ||k; + Z2|| + llu'-^^ill + IIW-Z2II 



>\ w +4 



(58) 



for any w. We then argue that the uniform distribution can be decomposed into these quadruplets such that 
each point on the sphere occurs in only one quadruplet (except for a measure zero set when zi is aligned 
with -w), thus concluding that (57) holds true. 




Figure 1: The two-dimensional construction for the proof of Lemma 15. 

Pick any direction w""" perpendicular to w. A quadruplet is defined by perpendicular vectors zi and Z2 which 
have length r and which lie in the plane spanned by w,w'-. Let 9 be the angle between -w and zi. Since 
we are now dealing with a two dimensional plane spanned by w and w"-, we may as well assume that w is 
aligned with the positive x-axis, as in Figure 1. We write w for ||w||. The coordinates of the quadruplet are 

(ly - rcos(6'), r sin(6')), (it; + rcos(6'), -r sin(6')), (w + r sm{9) , r cos{9)) , {w - r sm{9) , -r cos{9)) 

For brevity, let s = sin(0),c= cos{9). The desired inequafity (58) then reads 

\/w'^ - 8wc + + \/ + 8wc + + \/ + 8ws + + x/w^ - Sws + > 4\/ + 4 

To prove that this inequality holds, we square both sides, keeping in mind that the terms are non-negative. 
The sum of four squares on the left hand side gives 4w^ +4r^. For the six cross terms, we can pass to a lower 
bound by replacing in each square root by r^c^ or r^s^, whichever completes the square. Then observe 
that 

\w + rs\ • \w - rs\ + Iw + rc\ ■ Iw - rc\ = 2w^ - 
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while the other four cross terms 

{\w + rs\ ■ \w - rc\ + \w + rs\ ■ \w + rc|) + {\w - rs\ ■ \w + rc\ + \w - rs\ ■ \w - rc\) >\w + rs\ • 2w + \w - rs\ • 2w > iw'^ 

Doubhng the cross terms gives a contribution of 2{6w^ ~ ^^); while the sum of squares yielded Aw^ + 4r^. 
The desired inequality is satisfied as long as Idw^ + 2r^ > 16(10^ + 4), or r > 4\/2. 

□ 

Proof of Lemma 16. By Lemma 15, Assumption 2 is satisfied by distribution D with constant C = 4\/2. 
Hence by Lemma 13 we can conclude that for the randomized algorithm which at round t freshly draws 
Xt+i, . . . ,xt D and picks 



/(* = argmin sup 



|(/,^) + 



(we dropped the e's as the distribution is symmetric to start with) the expected regret is bounded as 



E[Reg^]<4x/2 E,,,...,,^.^ 





T 












t=i 


2- 



< 4\/2T 



We claim that the strategy specified in the lemma that chooses 

-E-:Jx. + 4x/2EL,i: 



V^||-E-:i'2;. + 4x/2Ef=,^ie,a;,|' + l 

is the same as choosing fl . To see this let us start by defining 

t-i T 
Xt = + 4\/2 Y, 



Now note that 



/* = argmin sup-^ (/,a;) + 



argmin sup{(/,a;) + \\xt - xW^] 
argmin sup ] (/, x) + A7||it - a;||; 

f^T xnX V 



argmin sup 

/e:^ x:\\x\\^<l 



argmm sup 



(/, x) + \J\\xtf -2{xux) + \\x\\ 



(/, x) + \J\\xtt -2{xt,x) 



+ 1 



However this argmin calculation is identical to the one in the proof of Proposition 4 (with C = 1 and T-t = 0) 
and the solution is given by 



ft - ft 



\] II - ZLl X^ + 4^2 Z^t+l eiX^ II 2 + 1 



Thus we conclude the proof. 



□ 
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Proof of Lemma 17. We first prove the statement for the convex case. To show admissibility using the 
particular randomized strategy given in the lemma, we need to show that for the randomized strategy 
specified by qt, 

sup{Ey,.g, [£{yt,yt)] + 'RelT {T\{xi,yi), . . . , {xt,yt))} < Heir {T\{xi,yi), ... ,{xt-i,yt-i)) 
yt 

for any xt. The strategy qt proposed by the lemma is such that we first draw {xt+i,yt+i), . . . , {xr^yx) ~ D 
and et+i, . . . er Rademacher random variables, and then based on this sample pick yt = yt{xt+i:T, Ut+i-.T, ^t+i-.r) 
as in (28). Hence, 

sup{Eji,.,, [eiyt,yt)] + Relr iJ'\ixi,yi), . . . ,{xt,yt))} 



sup 

yt 



E t{yt,yt)+ E sup 

t ( = t + l:T .Bt + 1:T) 



< E sup-^^(?;t,2/t) + sup 



("^t+l:Ti!'t+l:r) 



C ed{f{x^),y,)-Lt{f) 



Now, with yt in (28), 



sup-^^(?/t,yt) + sup 



C E ed{f{x,),y.)-Lt{f) 



i=t+i 



M-B,B] 



yt 



inf supU(y,2/t) + sup 



C eAf{^^),y^)-Lt{f) 



ye[-B,B] 



C E edU{x,).y^)-Lt{f) 



i=t+i 



Now we assume that the loss i{y,y) is convex in the first argument (and bounded). Note that the term 



E 



yt~pt 



i(y,yt)+ sup \C Y eAf{^^),y^)-Lt{f) 



is linear in pt and, due to convexity of loss, is convex in yt. Hence by the minimax theorem, for this choice 
of qt, we conclude that 

sup{Eg,.q, [e{yt, yt)] + Relr {T\{xi,yi), {xt,yt))} 

Vt 



E inf supEyj^pj 



Kyt,yt) + sup 



E 

>yt+i)---(^T'yT) 



Pt yte[-B,B] 



Ac Y ed{f{x,),y^)-Lt{f)\ 

{ i=t+l 

<\c Y ed{f{x^),y^)-Lt{f)\ 

I i=t+l 

The last step above is due to the minimax theorem as the loss is convex in yt, the set [-B,B] is compact. 



^{yt,yt) +sup^ 
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r T t-i 



and the term is linear in pt . The above expression is equal to 

^yt~Pt [Kyt,yt)]-i{f{xt),yt) 

3Ey,.p,supic e^^(/(a;^),y^) -Lt-i(/) + inf E^^.p, [e(g(xt),yt)] - e{f(xt), 
f^^ [ i=t+i 

Ac ^ e,£(/(xO,yO-it-i(/) + Ey*~P. W/(^t),yt)]-^(/(^t),yt) 

■ I i=t+l 

)jc ^ e,£(/(x,),2/,;)-it-i(/)+CQ^(/(xO,yt) 



< ESUpjCyj^p^ _ ^ 

Pt /e^ I i=t+l 



< EsupEyj^pj sup _ ^ 
Pt f^y^ [ i=t+i 

T 

^ E E(^^,,y^)^£,E,, sup 

€t + l:T -^-'^ 
(==f+l-!'t+l).- -'(^T-Wt) 



f^J" { i=t+i 
= RelT {T\{xi,yi),. . . , {xt-i,yt-i)) 



The second part of the Lemma is proved analogously. 



□ 



Proof of Lemma 18. Now let qt be the randomized strategy where we draw ei+i,...,eT uniformly at 
random and pick 



qt{e) = argmin sup i E/^^, [^(/f, Xj)] + sup 



2 X e,^(/,x*(e))-X£(/,xO 



t 

i=l 



(59) 



With the definition of x* in (30), and with the notation Lt{f) = T,l=i^{f,Xi) 



sup]E/j^,^ [eift,xt)] +supE, sup 



2 X £,£(/, x,;(e))-L,(/) 



sup E, [E;^.,^(,) [£(/t, Xt)]] + E, sup 



2 X e,^(/,xK6))-it(/) 



<E, 



E, 



sup]e^,^,^(^) [£(/t,a;t)] + sup 

Xt [ 

l^ft-qt [e{ft,xt)] + sni, 



sup. 



2 X e,£(/,x*(e))-Lt(/) 



2 X e,^(/,x*(e))-it(/) 
i=t+i 



where the last step is due to the way we pick our predictor /t(e) given random draw of e's in Equation (59). 
We now apply the minimax theorem, yielding the following upper bound on the term above: 



E, 



sup inf \ E^r^^pj [£(/t, Xt)] + Ext~pt sup 

.Pt^A{X) ft^^ [ fej^ 



2 X e,£{f,4{e))-Lt{f) 



This expression can be re-written as 



E, 


sup 


E^,.p, sup 









< E, 


sup 


l^xt.x'-pt^et sup 




.PteA{X) 





2 Y, ed{f,^,{e))-Lt-i{f) + Ex,.p,[e{f,xt)]-£{f,xt) 



2 Y eAf,<{^)) - Lt-i{f) + et{i{f,xt) - e{f,xt)) 
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By passing to the supremum over Xt,x[, we get an upper bound 



sup < sup 



sup \ Egj sup 



< SUpEe 



< sup Ee sup 









sup 




sup 


XtdX 








T 





i=t+l 

2 £ ed{f,^l{e))-Lt^i{f) + 2et£{f,xt) 

i=t+l 

2 f e,^(/,i,(e))-it_i(/)+2et^(/,a;t) 



i=t+i 



2^Q^(/,x,(e))-Lt_i(/) 



□ 



Proof of Lemma 19. We shall start by showing that the relaxation is admissible for the game where we 
pick prediction yt and the adversary then directly picks the gradient d£{yt,yt)- To this end note that 

inf sup {di{yt,yt)-yt + 'RelT{J^\d£{yi,yi),...,d£{yt,yt))} 

de{yt,vt) 

sup2L ^ eJ[t]-Y,d£{y,,y,)-f[i] 

S<=T i=t+l i=l 



inf sup \ d£{yt,yt) ■yt+'^c 



< inf sup Irfiit + ^e 



sup2i ^^M-Lt-iU)-rfm 



Let us use the notation Lt-i{f) = ZLi 9£{yi,yi) ■ f[i] for the present proof. The supremum over rt € [-L, L] 
is achieved at the endpoints since the expression is convex in rt- Therefore, the last expression is equal to 



inf sup irt-yt+K^ sup 

yt rte{-L,L} [ ' fej^ 



inf sup ^rt~pt 

yt pteAi{-LX}) 

sup infErt~pj 

pt6A({-L,L}) yt 



2L ^ eJ[t]-Lt^^{f)-rff[t] 

i=t+l 

2L £ e,f[t]-Lt-i{f)-rff[t] 

i=t+l 

2L ^ e,;/M-Lt_i(/)-rf/[t] 



rfyt+ Ee sup 



rfyt+ Ee sup 



where the last step is due to the minimax theorem. The last quantity is equal to 



sup Ef E. 

pteA{{-L,L}) 

< sup Ej 

pteAi{-L.L}) 



'rt~Pt 



inf E^t-pj [rt] ■ yt + sup 2L V 
yt \ 



d\t\-U-xU)-rff\t\ 



i=t+l 



)]] 



E 



rt~Pt 



\2L Z 



sup\ -ZL 2, ej[t] - Lt-i{f) + {Er,~pAn] - rt) ■ f[t] 
fer \ i=t+i I 



< sup Krt,r'^~pt 

pteA{{-L,L}) 

sup Ert,r't~pt 
pteA{{-L,L}) 



Ee sup 

Eg sup 



2L eJ[t]-Lt-i{f) + {r't-rt)-f[t] 

i=t+i 

2L Y eJ[t]-Lt^,{f)+et{r't-rt)-f[t] 
i=t+i 



By passing to the worst-case choice of rt , r't (which is achieved at the endpoints because of convexity) , we 
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obtain a further upper bound 



sup Ee sup 
< sup Eg sup 

rte{L-L} fej^ 

sup Eg sup 

rt€{L,-L} fej^ 



2L eJ[t]-L,^,{f) + et{r[-n)-m 

i=t+l 

2i E eJM-it-i(/) + 2etrf/M 



2L^e,/[t]-Li_i(/) 



Thus we see that the relaxation is admissible. Now the corresponding prediction is given by 

sup|2L £ eJ[z]-'Y^d£{y,,y,)m-nm 



jjt = argmin sup -{ r^y + E^ 

y rti[-L,L] 



argmin sup -j rtjj + Eg 

y rti[-L,L] 



i=t+l 
T 



i=l 

t-l 



J^T \ i=t+l i=l 



argmin sup -j r^y + Ej 

y rte{-L,L} 



sup 



Ul ^ eJ[q-''Zde{y,,y.)m-rtm 



The last step holds because of convexity of the term inside the supremum over is convex in and so 
the suprema is attained at the endpoints of the interval. The yt above is attained when both terms of the 
supremum are equalized, that is for yt is the prediction that satisfies : 



1 



t-i 



1 



i=l 



sup] E ej[f\- — Y,de{y,,yOm + -m \ -snp\ ^ e - _ ^ - -/M 



1 



t-i 



1 



2L 



i=i 



Finally since the relaxation is admissible we can conclude that the regret of the algorithm is bounded as 



Regj, < RelT {T) = 2 L E, 



supXlet/M 



This concludes the proof. 



□ 



Proof of Lemma 20. The proof is similar to that of Lemma 19, with a few more twists. We want to 
establish admissibility of the relaxation given in (35) w.r.t. the randomized strategy qt we provided. To this 
end note that 



sup]Ey,.g, [l{yt,yt)]+Kc 



yt 



= sup 

yt 

<E, 



i=t+l 

|EeWyi(e),2/t)]+EJsup|2i ^ 



sup|2L ^^m-Lt{f) 



tM-LtU) 



i=t+l 



sup|€(yt(e),yt)+sup|2L f eJ[i]-Lt{f) 

. Vt [ S^^ I i=t+l 



by Jensen's inequality, with the usual notation Lt{f) = Y,l=i J/i)- Further, by convexity of the loss, we 

may pass to the upper bound 



E, 



sup 
yt ( 



lde{yt{^),yt)yt{^)+ sup Ul ^ 



J[i]-Lt-i{f)-d£{yt{e),yt)f[t] 



i=t+l 



<E, 



sup-^E^t [n-ytie)] +sup 



yt 



e./W-it-i(/)-E., [n-flt]] 
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where rt is a {±L}-valued random variable with the mean d£{yt{e) , yt) ■ With the help of Jensen's inequality, 
and passing to the worst-case rt (observe that this is legal for any given e), we have an upper bound 



4. 



Vt { 



T 



sup 



{2L f eJ[i]-Lt^,{f)-rfm 



sup I rt ■ yt{e) + sup \2L ^ eJ[i]-Lt-i{f)-rff[t] 



I 



(60) 



Now the strategy we defined is 



yt(e) = argmin sup -! ■ yt(e) + sup \ 2L Y 

I 



yt rte{±L} 

which can be re-written as 



^ e»/W-E^(/W,2/0-^f/W 

i=t+l 1=1 



(supj ^ e,m-^Lt-^if) + lm]-sup{ z ^./w-:^i*-i(/)- J/w|) 



By this choice of yt{t), plugging back in Equation (60) we see that 

T 



sup^^Ej^^.,^ [£(j/t,yt)]+E, 

Vt I 



sup 



<E, 



E, 



E, 



rtE{±L} 



I2L f eJM-Lt(/) 

T 



sup |rfyt(e) + sup|2L ^ W - it-i(/) - ■ /[i] 



inf sup \rfyt + sup 



e^m-Lt-.in-rt-m 



inf sup E^j^pj J r4 ■ j/f + sup \ 2L ^ " Lt-i 



Vt pteA{{±L}) 



The expression inside the supremum is linear in pt, as it is an expectation. Also note that the term is convex 
inj/t, and the domain T/f e [- sup^r^^^ |/[f]|, supy^^ |/[t]|] is a bounded interval (hence, compact). We conclude 
that we can use the minimax theorem, yielding 



E, 



sup infE^j^pj 

LpteA({±L}) yt 



E, 



E, 



< E, 



sup < inf Er^^pj [rt ■ ijt] + ^n-pt 
.PteA({±L}) [ yt 



sup \Ert~Pt 
,PteA{{±L}) 



rfyt+sup\2L ^ [i] - it-i(/) - ■ /[i] 

fej' { i=t+l 

snpUl E eJ[i]-Lt^i{f)-rfm 

fej' [ i=t+l 

supjinfE^^.pJrf ■yt] + 2L ^ eJ[i]-Lt-i{f)-rff[t] 

I yt i=t+i 



sup \Krt~Pt 
,PteA{{±L}) 



sup 



lEr,.pArfm] + 2L E eJ[i]-Lt-i{f)-rfm 



In the last step, we replaced the infimum over yt with f[t\, only increasing the quantity. Introducing an 
i.i.d. copy rj of rt, 



= E, 
<E, 



sup -^E^.^p, 

,pteA{{±L}) 



sup 



sup lEr^ yp^ 
_pteA({±L}) 



[2L E e,f[i]-Lt-^U) + {Er,.pArt]-rt)-m 
i=t+i 

T 



sup 



E Q/W-it-i(/) + (r-;-rt)-/[t] 



i=t+l 
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Introducing the random sign and passing to the supremum over rt,r[, yields the upper bound 

T 





sup 




sup] 




_PteA{{±L}) 







<E, 



sup -j 

rt,r;E{±L} 



sup|2L £ eJ[i]-Lt-i{f)+et{r[-n)-m 



sup -j Efj 

rt,r;E{±i} 



i=t+l 
T 



sup 



sup \ 

rt,rJe{±L} 



2 

T 

I 

i=t+l 



supji J eJW-^it_i(/)-etrf/[i] 



In the above we spht the term in the supremum as the sum of two terms one involving and other r[ (other 
terms are equally split by dividing by 2), yielding 



E, 



sup \ Eej 



sup 



e,/[z]-Lt_i(/) + 2et n- f[t\ 



The above step used the fact that the first term only involved r[ and second only and further and -e^ 
have the same distribution. Now finally noting that irrespective of whether in the above supremum is L 
or -i, since it is multiplied by et we obtain an upper bound 



E, 



sup 



We conclude that the relaxation 

RgIt ...,?;*) = E, 



sup|2i E e./W-it(/) 



is admissible and further the randomized strategy where on each round we first draw e's and then set 
is an admissible strategy. Hence, the expected regret under the strategy is bounded as 



E [Regy] < RelT (J") = 2L 



sup E 



which concludes the proof. 



□ 



Proof of Lemma 23. The proof is almost identical to the proof of admissibility for the Mirror Descent 
relaxation, so let us only point out the differences. Let Xt^i = ^'^^ Mt-i ~ ^rx^t-i- Using the fact that 

Xt is (Tt-close to /it-i, we expand 



Xt-1 



,xt - fit-i) + c 



E 

5=t+l / 



1/2 
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As before, pick xt = (3xt-i + "fy for some y e Kernel(V||it_ip). The above expression under the square root 
then becomes 



and the only difference from the expression in (40) is that we have a /3' instead of /3 under the square root. 
Taking the derivatives, we see that 



forces /?' = and we conclude admissibility. 



Arriving at the Relaxation We upper bound the sequential Rademacher complexity as 
2 



sup Ef 
a (x,x')6r 



sup I f, 



< + - sup 



a A 



(x.x') 



T / 1 ''-I \ * 

s=t+l V r=t+l / r=l 

T / 1^-1 \ * 

^ eJx,(e)-— Xri^M-Y^r 

i=i+l \ St .r=t+l / r=l 




T / 1 '^-l \ * 

i=t+l V ^ ^ r=t+l / r=\ 



+ sup c Y 

(x,x') s=t+l 



1 ''-I 
5 r ^ — + I 1 



Since (x,x') e T are pairs of tree such that for any e e {±1}"'" and any t e [T]. 

C(a;i,...,a;t,xi(ei),...,xt-i(et_i),xt(e)) = 1 
we can conclude that for any e e {±1}"^ and any t e [T], 

1 *-i 



Xt(e)- — -r Exr(er) 

I -1- r=l 



Proof of Lemma 24- Then Sequential Rademacher complexity can be upper bounded as 



SUpEe 



T-t 



Y^t+Y e»x,(e) 

1=1 i=l 



< sup I^Ee 

< sup I Ee 



t T-t 

i=l 1=1 

t T-t 

i=l i=l 



■CEE.[l|x.(e)r] 

1=1 



(61) 
(62) 

(63) 



□ 



i/p 



+ C(T-i) 



(E + I |e . + C(T - t + 1) j 



1/p 
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and admissibility is verified in a similar way to the 2-smooth case in the Section 3. Here we instead use 
p-smoothness which follows from result in [In]. The form of update specified by the relaxation in this case 
follows exactly the proof of Proposition 4, yielding 

It 



□ 

Lemma 26. The regret upper bound 

T T T m ki 

Y.^UuXt)-i^iTlU.Xt)<T.^Uu^t)-T. inf ^ E ^U.^t) ■ (64) 

is valid. 

Proof of Lemma 26. To prove this inequality, it is enough to show that it holds for subdividing T into 
two blocks fci and k2. Observe, that the comparator term becomes only smaller if we pass to two instead of 
one infima, but we must check that no function / that minimizes the loss over the first block is removed from 
being a potential minimizer over the second block. This is exactly the definition of T^'^{xi, . . . □ 

Lemma 27. The relaxation 

RelT . . . = - inij^x.if) + {T-t) inf sup ||/ - f'\\ 

is admissible. 

Proof of Lemma 27. First, 

T 

RelT {^\xi, . . .,xt) = - inf Y^^tif)- 



As for admissibility. 



inf sup{a;(/t) + RcIt {T\xi, . . . ,Xt-i,x)} 

= inf sup j.T(/i) - inf (e + x{f)\\ + {T-t) inf sup ||/ - f'\\ 

< inf sup j.T(/i) - inf - inf + {T - t) inf sup ||/ - f'\\ 

< inf sup|sup(V.T,/t-/)- inf §a;.(/)| + (r-t) inf sup ||/-/'|| 

< inf (sup lift - /II - inf E + {T-t) inf sup ||/ - f'\\ 
= RelT {T\xi,...,xt-i) 



□ 
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